Write a custom evaluator
A custom evaluator is plain Python that scores a response. No model, no judge prompt: you write a function, return a value, and Middleware attaches the result to the trace span.
Use one whenever correctness is objective and you can check it in code:
- Non-empty or well-formed output (valid JSON, required keys)
- Regex or exact match against an expected answer
- Length or format constraints
- Similarity to a reference above a threshold
These checks are faster and free compared to an LLM-as-a-judge, and they don't drift. Reach for a judge only when the check needs actual judgment.
How it works#
You give the evaluator an EvaluatorContext (the input, output, and any expected output), it returns a value, and evaluate_and_submit records the result on the active span:
1EvaluatorContext (input, output, expected_output)
2 │
3 ▼
4your evaluator function ──► value + optional pass/fail
5 │
6 ▼
7attached to the active trace span in MiddlewareThe return type sets the metric type automatically:
| Return | Metric type |
|---|---|
bool | boolean (pass/fail) |
int / float | score |
str | categorical |
None | skipped |
What you'll need#
1pip install middleware-llmobs openinference-instrumentation-openai openaiExport your Middleware endpoint, key, and OPENAI_API_KEY as in Trace an LLM application.
A quick function check#
Use the @evaluator decorator for a one-off check. It captures exceptions into the result and normalizes the return value, so a thrown error becomes a failed eval instead of crashing your app.
This example asks the model for a short summary, then checks the real response stays within a word budget, a deterministic rule an LLM judge is overkill for. The model call and the check share one span (evaluate_and_submit binds to the active span), so the result attaches to it:
1from openai import OpenAI
2from middleware.llmobs import (
3 register, evaluator, EvaluatorContext, EvaluatorResult,
4 evaluate_and_submit, flush_evaluations,
5)
6
7providers = register(service_name="eval-example", auto_instrument=True)
8tracer = providers.tracer.get_tracer(__name__)
9client = OpenAI()
10
11MAX_WORDS = 30
12
13@evaluator(name="within_word_limit")
14def within_word_limit(ctx: EvaluatorContext) -> EvaluatorResult:
15 """Pass if the response is at most MAX_WORDS words."""
16 words = len(ctx.output.split())
17 return EvaluatorResult(
18 value=words <= MAX_WORDS,
19 assessment="pass" if words <= MAX_WORDS else "fail",
20 reasoning=f"{words} words (limit {MAX_WORDS}).",
21 )
22
23@tracer.chain
24def summarize(text: str) -> str:
25 answer = client.chat.completions.create( # auto-instrumented LLM span, nested here
26 model="gpt-4o-mini",
27 messages=[
28 {"role": "system", "content": f"Summarize the text in {MAX_WORDS} words or fewer."},
29 {"role": "user", "content": text},
30 ],
31 ).choices[0].message.content
32
33 # Active span is the chain span, so the eval attaches to it.
34 evaluate_and_submit(within_word_limit, EvaluatorContext(input=text, output=answer))
35 return answer
36
37summarize("OpenTelemetry is an observability framework for generating, collecting, and exporting telemetry data such as traces, metrics, and logs.")
38flush_evaluations()A stateful evaluator with a threshold#
When an evaluator needs configuration or state (a threshold, a client, a compiled pattern), subclass BaseEvaluator and implement evaluate(). Return an EvaluatorResult to set the value and a pass/fail assessment:
1from difflib import SequenceMatcher
2from middleware.llmobs import BaseEvaluator, EvaluatorResult, EvaluatorContext, evaluate_and_submit
3
4class SimilarityEvaluator(BaseEvaluator):
5 def __init__(self, threshold=0.8):
6 super().__init__(name="similarity")
7 self.threshold = threshold
8
9 def evaluate(self, ctx: EvaluatorContext):
10 score = SequenceMatcher(None, ctx.output, ctx.expected_output).ratio()
11 return EvaluatorResult(
12 value=round(score, 2),
13 assessment="pass" if score >= self.threshold else "fail",
14 reasoning=f"Similarity {score:.2f} vs threshold {self.threshold}.",
15 )
16
17similarity = SimilarityEvaluator(threshold=0.85)
18evaluate_and_submit(
19 similarity,
20 EvaluatorContext(output="Paris", expected_output="Paris"),
21)What you get#
The evaluation attaches to the summarize span and shows on its Evaluation tab in LLM Traces, with the value and pass/fail verdict.

It's also exported as metrics (gen_ai.evaluations.*), so a custom check shows up on dashboards exactly like a server-side or LLM-judge eval.
evaluate() may run concurrently across threads or tasks, so don't mutate instance attributes inside it; use locals. For async def evaluators, use @async_evaluator (or AsyncBaseEvaluator) with await aevaluate_and_submit(...).
Next steps#
- Evaluate with an LLM-as-a-judge when a rule isn't enough and you need a model's judgment.
- Server-side evaluations to run checks on live traffic from the UI.