Skip to content

Quick Start

This guide walks you through uploading a testset to Arize Phoenix and running your first experiment with evalwire.

Prerequisites

  • Python >= 3.10
  • A running Arize Phoenix instance (see Phoenix docs for local setup via Docker)
  • Your Phoenix endpoint exported as PHOENIX_BASE_URL (default: http://localhost:6006)

Install

pip install 'evalwire[langgraph]'

Step 1 — Prepare your CSV testset

Create a CSV file with at minimum a tags column, an input column, and an expected-output column:

user_query,expected_output,tags
"What is a large language model?","A deep learning model trained on text.","rag_pipeline"
"How does retrieval augmented generation work?","RAG retrieves context before generating.","rag_pipeline"
  • tags — names the Phoenix dataset the row belongs to. Pipe-delimit to assign a row to multiple datasets: es_search|source_router.

Step 2 — Upload to Phoenix

evalwire upload --csv data/testset.csv

Options:

Flag Default Description
--on-exist skip skip Leave existing datasets untouched
--on-exist overwrite Delete and re-create
--on-exist append Add rows to existing dataset
--input-keys COL user_query Comma-separated input column names
--output-keys COL expected_output Comma-separated output column names

Step 3 — Write a task

Create experiments/rag_pipeline/task.py:

from evalwire.langgraph import invoke_node
from agent.graph import RAGState, retrieve

async def task(example) -> list[str]:
    result = await invoke_node(retrieve, example.input["user_query"], RAGState)
    return result["retrieved_titles"]

Step 4 — Choose an evaluator

You can write a custom evaluator function or use one of the built-in factories.

Using a built-in evaluator

Create experiments/rag_pipeline/top_k.py:

from evalwire.evaluators import make_top_k_evaluator

top_k = make_top_k_evaluator(K=5)

All nine built-in factories are available from evalwire.evaluators:

Factory Returns When to use
make_top_k_evaluator(K) float Ranked retrieval — score by position
make_membership_evaluator() bool Classification / routing label
make_exact_match_evaluator() bool Single correct string answer
make_contains_evaluator() bool Output must include a required phrase
make_regex_evaluator() bool Output must match a regex pattern
make_json_match_evaluator(keys) float Structured output key-value matching
make_schema_evaluator(schema) bool JSON Schema conformance
make_numeric_tolerance_evaluator(atol, rtol) bool Numeric answer within tolerance
make_llm_judge_evaluator(model, prompt, schema) float\|bool LLM-as-a-judge

Writing a custom evaluator

def top_k(output: list[str], expected: dict) -> float:
    """Fraction of expected titles present in the top-K retrieved results."""
    expected_titles = {t.strip() for t in expected.get("expected_output", "").split("|") if t.strip()}
    if not expected_titles:
        return 0.0
    hits = sum(1 for t in output if t in expected_titles)
    return hits / len(expected_titles)

Using the LLM judge

from pydantic import BaseModel
from langchain.chat_models import init_chat_model
from evalwire.evaluators import make_llm_judge_evaluator

class Verdict(BaseModel):
    explanation: str
    score: bool  # True = correct

llm_judge = make_llm_judge_evaluator(
    model=init_chat_model("gpt-4o-mini"),
    prompt_template=(
        "Output: {output}\n"
        "Expected: {expected_output}\n"
        "Is the output correct? Think step by step, then set score."
    ),
    output_schema=Verdict,
)

Requires pip install 'evalwire[llm-judge]'.

Step 5 — Run experiments

evalwire run --experiments experiments/

Results appear in the Phoenix UI under the Experiments tab for each dataset.

Using a config file

Avoid repeating flags by creating evalwire.toml:

[dataset]
csv_path = "data/testset.csv"
on_exist = "skip"

[experiments]
dir = "experiments"
prefix = "eval"
concurrency = 4

Then simply run:

evalwire upload
evalwire run