evalwire¶
evalwire is a Python package for systematic evaluation of any async callable — including LangGraph nodes, plain functions, REST API endpoints, and other LLM frameworks — using Arize Phoenix experiments.
Features¶
- Upload CSV testsets to Phoenix as named datasets
- Run experiments against any async callable with pluggable evaluators
- 12 built-in evaluator factories covering retrieval, classification, string matching, structured output, numeric, LLM-as-a-judge, and evaluator composition
- Export experiment results to CSV or JSON, compare runs, and generate markdown reports
- Validate testsets before upload to catch structural and content issues early
- First-class LangGraph integration via the optional
evalwire[langgraph]extra - OpenTelemetry tracing via
observability.py - Config-file driven via
evalwire.toml - CLI:
evalwire upload,evalwire run,evalwire validate,evalwire export,evalwire compare,evalwire report
Built-in evaluators¶
| Factory | Returns | Use case |
|---|---|---|
make_top_k_evaluator |
float |
Position-weighted retrieval scoring |
make_membership_evaluator |
bool |
Classification / routing label check |
make_exact_match_evaluator |
bool |
Extractive QA, single ground-truth string |
make_contains_evaluator |
bool |
Free-text generation, required phrase |
make_regex_evaluator |
bool |
Structured format validation (dates, IDs, …) |
make_json_match_evaluator |
float |
Tool-call / structured-output key matching |
make_schema_evaluator |
bool |
JSON Schema conformance |
make_numeric_tolerance_evaluator |
bool |
Math / calculation tasks with tolerance |
make_llm_judge_evaluator |
float \| bool |
LLM-as-a-judge with structured output |
make_weighted_evaluator |
float |
Weighted average of multiple evaluators |
make_all_pass_evaluator |
bool |
AND-composition: all evaluators must pass |
make_any_pass_evaluator |
bool |
OR-composition: at least one evaluator must pass |
Navigation¶
- Quick Start: get up and running in minutes
- Concepts: understand datasets, experiments, tasks, and evaluators
- Guides: Writing Custom Evaluators: evaluator contract, patterns, and best practices
- Configuration: full
evalwire.tomlreference - Troubleshooting: common errors and fixes
- API Reference: full module documentation
- Changelog: version history