evalwire.evaluators¶
Built-in evaluator factories. Each factory returns a callable with the standard evalwire evaluator signature:
The expected dict always contains at minimum an "expected_output" key whose
value is parsed by the shared _parse_expected helper (handles plain strings,
Python-literal strings such as "['a','b']", and lists).
All factories are importable directly from evalwire.evaluators:
from evalwire.evaluators import (
make_top_k_evaluator,
make_membership_evaluator,
make_exact_match_evaluator,
make_contains_evaluator,
make_regex_evaluator,
make_json_match_evaluator,
make_schema_evaluator,
make_numeric_tolerance_evaluator,
make_llm_judge_evaluator,
)
Retrieval¶
evalwire.evaluators.top_k.make_top_k_evaluator(K=20)
¶
Return a position-weighted retrieval scoring evaluator.
The returned callable scores a ranked list output against expected items.
Algorithm
score_per_item = 1.0 - (position / K) if item found in output[:K] else 0.0 final_score = mean(score_per_item for item in expected_output)
Parameters¶
K: Window size. Items found beyond position K-1 score 0.0.
Returns¶
Callable[[list[str], dict], float]
Evaluator with signature top_k(output, expected) -> float.
output is a list of strings ordered by relevance (most relevant first).
expected is a dict with key "expected_output" containing a
list[str] or a str parseable by ast.literal_eval.
Source code in src/evalwire/evaluators/top_k.py
Classification¶
evalwire.evaluators.membership.make_membership_evaluator()
¶
Return an exact-membership check evaluator.
Designed for classification/routing outputs where the expected value is one of a small set of labels.
Returns¶
Callable[[str, dict], bool]
Evaluator with signature is_in(output, expected) -> bool.
output is the predicted label string.
expected is a dict with key "expected_output" containing a
list[str] or a str parseable by ast.literal_eval.
Returns True if output is in the expected list.
Source code in src/evalwire/evaluators/membership.py
String matching¶
evalwire.evaluators.exact_match.make_exact_match_evaluator()
¶
Return a strict string-equality evaluator.
Compares the model output against a single ground-truth string stored in
expected["expected_output"]. Useful for extractive QA and any task
where exactly one correct answer exists.
Returns¶
Callable[[str, dict], bool]
Evaluator with signature exact_match(output, expected) -> bool.
output is the model-generated string.
expected is a dict with key "expected_output" containing a
single string (or a single-element list/literal whose first element
is the ground truth).
Returns True only when output equals the first expected item
character-for-character (case-sensitive).
Returns False when output is None, the key is absent, or
the expected list is empty.
Source code in src/evalwire/evaluators/exact_match.py
evalwire.evaluators.contains.make_contains_evaluator()
¶
Return a substring-containment evaluator.
Checks whether the first value in expected["expected_output"] appears
as a substring of output. Useful for free-text generation tasks where
the answer must include a specific phrase or keyword.
To test the reverse (output is a substring of the expected string), wrap
the result with not::
contains = make_contains_evaluator()
inverted = lambda out, exp: not contains(out, exp)
Returns¶
Callable[[str, dict], bool]
Evaluator with signature contains(output, expected) -> bool.
output is the model-generated string.
expected is a dict with key "expected_output" whose first item
is the substring that must appear in output.
Returns False when output is None, the key is absent, or
the expected list is empty.
Source code in src/evalwire/evaluators/contains.py
evalwire.evaluators.regex.make_regex_evaluator()
¶
Return a regular-expression match evaluator.
Treats the first value of expected["expected_output"] as a regex
pattern and applies :func:re.search against output. Useful for
validating structured outputs such as dates, identifiers, URLs, or code
snippets.
The pattern is compiled at call time so that an invalid regex raises
:class:re.error immediately, giving the user a clear signal.
Returns¶
Callable[[str, dict], bool]
Evaluator with signature regex_match(output, expected) -> bool.
output is the string to match against.
expected is a dict with key "expected_output" containing the
regex pattern string.
Returns False when output is None, the pattern is empty,
or the key is absent.
Raises :class:re.error if the pattern is syntactically invalid.
Source code in src/evalwire/evaluators/regex.py
Structured output¶
evalwire.evaluators.json_match.make_json_match_evaluator(keys=None)
¶
Return a partial JSON key-value matching evaluator.
Parses output as a JSON object and compares specific key-value pairs
against an expected JSON object stored in expected["expected_output"].
Useful for evaluating tool-call outputs, structured generation, and API
response validation.
Parameters¶
keys:
An optional list of key names to check. When provided, only those
keys are compared; keys present in the expected object but absent from
this list are ignored. When None (default), all keys present in
the expected object are checked.
Returns¶
Callable[[str, dict], float]
Evaluator with signature json_match(output, expected) -> float.
output is a JSON string representing an object.
expected is a dict with key "expected_output" containing a
JSON string (or a Python-literal string) that represents the
ground-truth object.
Score is the fraction of checked keys whose values match exactly:
n_matching / n_checked.
Returns 0.0 when output is not valid JSON, when the
expected value is empty or not a JSON object, or when no keys are
checked.
Source code in src/evalwire/evaluators/json_match.py
evalwire.evaluators.schema.make_schema_evaluator(schema)
¶
Return a JSON Schema validation evaluator.
Parses output as JSON and validates it against the provided JSON
Schema dict using jsonschema. Useful for asserting that LLM outputs
conform to a declared schema regardless of the specific values produced.
The JSON schema is bound at factory-creation time so the same validator can be reused across many evaluation rows without re-compiling.
Parameters¶
schema: A JSON Schema dict (Draft 7 / Draft 2020-12) describing the expected structure of the output.
Returns¶
Callable[[str, dict], bool]
Evaluator with signature schema_valid(output, expected) -> bool.
output is the JSON string to validate.
expected is not used at evaluation time (the schema is fixed at
factory-creation time) but follows the standard evaluator contract.
Returns True when output is valid JSON that satisfies
schema, False otherwise.
Raises¶
ImportError
If jsonschema is not installed. Install it with::
pip install 'jsonschema>=4.0'
Source code in src/evalwire/evaluators/schema.py
Numeric¶
evalwire.evaluators.numeric_tolerance.make_numeric_tolerance_evaluator(atol=1e-06, rtol=0.0)
¶
Return a numeric proximity evaluator.
Checks whether a numeric model output is within an absolute and/or
relative tolerance of the expected value. Mirrors the semantics of
:func:math.isclose:
.. code-block:: text
|output - expected| <= atol + rtol * |expected|
Useful for math-reasoning, unit-conversion, and calculation agent tasks.
Parameters¶
atol:
Absolute tolerance (default 1e-6).
rtol:
Relative tolerance as a fraction of the expected value
(default 0.0). Set to e.g. 0.01 for a 1 % tolerance.
Returns¶
Callable[[str | float, dict], bool]
Evaluator with signature numeric_close(output, expected) -> bool.
output may be a numeric string or a float/int.
expected is a dict with key "expected_output" containing a
numeric string or a single-element list with a numeric string.
Returns False when either value cannot be converted to float,
when expected is empty, or when the key is missing.
Source code in src/evalwire/evaluators/numeric_tolerance.py
LLM judge¶
evalwire.evaluators.llm_judge.make_llm_judge_evaluator(model, prompt_template, output_schema, *, result_key='score', on_error='silent', error_callback=None)
¶
Return an LLM-as-a-judge evaluator backed by a LangChain chat model.
Uses a user-supplied LangChain BaseChatModel with structured output
(via :meth:~langchain_core.language_models.BaseChatModel.with_structured_output)
to evaluate free-text model outputs against an expected value. The judge
model, evaluation prompt, and output schema are all provided by the caller,
making this evaluator fully flexible across task types (binary pass/fail,
1-5 rating, open rubrics, etc.).
Following Arize best-practice guidance, the prompt template should:
- State the evaluation criteria explicitly.
- Request chain-of-thought reasoning before the final score/verdict so
the explanation field precedes the
result_keyfield in the schema. - Use
{output}and{expected_output}as placeholders for the model output and the ground-truth value respectively.
Example usage::
from langchain.chat_models import init_chat_model
from pydantic import BaseModel
class Verdict(BaseModel):
explanation: str
score: bool # True = correct, False = incorrect
judge = make_llm_judge_evaluator(
model=init_chat_model("gpt-4o-mini"),
prompt_template=(
"You are an expert evaluator.\n"
"Question answer: {output}\n"
"Expected answer: {expected_output}\n"
"Is the answer factually correct? "
"Think step by step, then set score to true or false."
),
output_schema=Verdict,
)
Parameters¶
model:
A LangChain BaseChatModel instance (e.g. obtained via
langchain.chat_models.init_chat_model).
prompt_template:
A string containing {output} and optionally
{expected_output} placeholders that will be formatted at
evaluation time.
output_schema:
A Pydantic BaseModel subclass. The factory calls
model.with_structured_output(output_schema) once to bind the
structured-output chain.
result_key:
Name of the field on the schema instance whose value is returned
as the final score. Defaults to "score". The return type of
the evaluator is inferred from the field's type annotation
(bool → returns bool; anything else → returns float).
on_error:
Behaviour when the LLM call or result extraction raises an
exception:
* ``"silent"`` (default) – swallow the exception and return the
zero-value for the inferred type (``False`` for ``bool``,
``0.0`` otherwise).
* ``"reraise"`` – call ``error_callback(exc)`` then re-raise.
``error_callback`` is required when this option is chosen.
error_callback:
A callable that receives the exception before it is re-raised.
Required when on_error="reraise", ignored otherwise.
Returns¶
Callable[[str, dict], float | bool]
Evaluator with signature llm_judge(output, expected) -> score.
Raises¶
ValueError
If on_error="reraise" is selected without supplying an
error_callback.
ImportError
If langchain-core is not installed. Install it with::
pip install 'evalwire[llm-judge]'
Source code in src/evalwire/evaluators/llm_judge.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | |