Skip to content

evalwire

evalwire is a Python package for systematic evaluation of any async callable — including LangGraph nodes, plain functions, REST API endpoints, and other LLM frameworks — using Arize Phoenix experiments.

Features

  • Upload CSV testsets to Phoenix as named datasets
  • Run experiments against any async callable with pluggable evaluators
  • 12 built-in evaluator factories covering retrieval, classification, string matching, structured output, numeric, LLM-as-a-judge, and evaluator composition
  • Export experiment results to CSV or JSON, compare runs, and generate markdown reports
  • Validate testsets before upload to catch structural and content issues early
  • First-class LangGraph integration via the optional evalwire[langgraph] extra
  • OpenTelemetry tracing via observability.py
  • Config-file driven via evalwire.toml
  • CLI: evalwire upload, evalwire run, evalwire validate, evalwire export, evalwire compare, evalwire report

Built-in evaluators

Factory Returns Use case
make_top_k_evaluator float Position-weighted retrieval scoring
make_membership_evaluator bool Classification / routing label check
make_exact_match_evaluator bool Extractive QA, single ground-truth string
make_contains_evaluator bool Free-text generation, required phrase
make_regex_evaluator bool Structured format validation (dates, IDs, …)
make_json_match_evaluator float Tool-call / structured-output key matching
make_schema_evaluator bool JSON Schema conformance
make_numeric_tolerance_evaluator bool Math / calculation tasks with tolerance
make_llm_judge_evaluator float \| bool LLM-as-a-judge with structured output
make_weighted_evaluator float Weighted average of multiple evaluators
make_all_pass_evaluator bool AND-composition: all evaluators must pass
make_any_pass_evaluator bool OR-composition: at least one evaluator must pass