Write a custom evaluator

A custom evaluator is plain Python that scores a response. No model, no judge prompt: you write a function, return a value, and Middleware attaches the result to the trace span.

Use one whenever correctness is objective and you can check it in code:

Non-empty or well-formed output (valid JSON, required keys)
Regex or exact match against an expected answer
Length or format constraints
Similarity to a reference above a threshold

These checks are faster and free compared to an LLM-as-a-judge, and they don't drift. Reach for a judge only when the check needs actual judgment.

How it works#

You give the evaluator an EvaluatorContext (the input, output, and any expected output), it returns a value, and evaluate_and_submit records the result on the active span:

1EvaluatorContext (input, output, expected_output)
2        │
3        ▼
4your evaluator function  ──►  value + optional pass/fail
5        │
6        ▼
7attached to the active trace span in Middleware

The return type sets the metric type automatically:

Return	Metric type
`bool`	boolean (pass/fail)
`int` / `float`	score
`str`	categorical
`None`	skipped

What you'll need#

1pip install middleware-llmobs openinference-instrumentation-openai openai

Export your Middleware endpoint, key, and OPENAI_API_KEY as in Trace an LLM application.

A quick function check#

Use the @evaluator decorator for a one-off check. It captures exceptions into the result and normalizes the return value, so a thrown error becomes a failed eval instead of crashing your app.

This example asks the model for a short summary, then checks the real response stays within a word budget, a deterministic rule an LLM judge is overkill for. The model call and the check share one span (evaluate_and_submit binds to the active span), so the result attaches to it:

1from openai import OpenAI
2from middleware.llmobs import (
3    register, evaluator, EvaluatorContext, EvaluatorResult,
4    evaluate_and_submit, flush_evaluations,
5)
6
7providers = register(service_name="eval-example", auto_instrument=True)
8tracer = providers.tracer.get_tracer(__name__)
9client = OpenAI()
10
11MAX_WORDS = 30
12
13@evaluator(name="within_word_limit")
14def within_word_limit(ctx: EvaluatorContext) -> EvaluatorResult:
15    """Pass if the response is at most MAX_WORDS words."""
16    words = len(ctx.output.split())
17    return EvaluatorResult(
18        value=words <= MAX_WORDS,
19        assessment="pass" if words <= MAX_WORDS else "fail",
20        reasoning=f"{words} words (limit {MAX_WORDS}).",
21    )
22
23@tracer.chain
24def summarize(text: str) -> str:
25    answer = client.chat.completions.create(   # auto-instrumented LLM span, nested here
26        model="gpt-4o-mini",
27        messages=[
28            {"role": "system", "content": f"Summarize the text in {MAX_WORDS} words or fewer."},
29            {"role": "user", "content": text},
30        ],
31    ).choices[0].message.content
32
33    # Active span is the chain span, so the eval attaches to it.
34    evaluate_and_submit(within_word_limit, EvaluatorContext(input=text, output=answer))
35    return answer
36
37summarize("OpenTelemetry is an observability framework for generating, collecting, and exporting telemetry data such as traces, metrics, and logs.")
38flush_evaluations()

A stateful evaluator with a threshold#

When an evaluator needs configuration or state (a threshold, a client, a compiled pattern), subclass BaseEvaluator and implement evaluate(). Return an EvaluatorResult to set the value and a pass/fail assessment:

1from difflib import SequenceMatcher
2from middleware.llmobs import BaseEvaluator, EvaluatorResult, EvaluatorContext, evaluate_and_submit
3
4class SimilarityEvaluator(BaseEvaluator):
5    def __init__(self, threshold=0.8):
6        super().__init__(name="similarity")
7        self.threshold = threshold
8
9    def evaluate(self, ctx: EvaluatorContext):
10        score = SequenceMatcher(None, ctx.output, ctx.expected_output).ratio()
11        return EvaluatorResult(
12            value=round(score, 2),
13            assessment="pass" if score >= self.threshold else "fail",
14            reasoning=f"Similarity {score:.2f} vs threshold {self.threshold}.",
15        )
16
17similarity = SimilarityEvaluator(threshold=0.85)
18evaluate_and_submit(
19    similarity,
20    EvaluatorContext(output="Paris", expected_output="Paris"),
21)

What you get#

The evaluation attaches to the summarize span and shows on its Evaluation tab in LLM Traces, with the value and pass/fail verdict.

The summarize trace with its Evaluation tab open, showing the within_word_limit result True and the reasoning 20 words (limit 30)

It's also exported as metrics (gen_ai.evaluations.*), so a custom check shows up on dashboards exactly like a server-side or LLM-judge eval.

evaluate() may run concurrently across threads or tasks, so don't mutate instance attributes inside it; use locals. For async def evaluators, use @async_evaluator (or AsyncBaseEvaluator) with await aevaluate_and_submit(...).

Next steps#

Evaluate with an LLM-as-a-judge when a rule isn't enough and you need a model's judgment.
Server-side evaluations to run checks on live traffic from the UI.