Client-side Evaluations (SDK)
Client-side evaluations run in your application code using the middleware-llmobs SDK. You define how responses are evaluated, whether by submitting a computed value, running your own LLM-as-a-judge, or executing a deterministic check. The SDK then attaches the result to the trace span and exports it to Middleware via OTLP.
Use client-side evaluations when you need logic that lives in your app: RAG-aware checks, offline or CI evaluation, deterministic checks (regex, JSON validity, similarity thresholds), or a judge you fully control.
For no-code, always-on evaluation of production traffic, use server-side evaluations instead. Middleware runs the judge for you.
Score value types#
An evaluation result can use one of three value types, the same types used everywhere in Middleware evaluations:
- boolean — a pass/fail result (e.g. a content-safety check).
- score — a number, with a passing range you define.
- categorical — a label from a defined set.
Each can carry an optional pass/fail assessment and the reasoning behind it.
Three ways to evaluate#
1. Submit a precomputed result#
If your application has already computed an evaluation result, record it with submit_evaluation. By default it binds to the active span:
1from middleware.llmobs import submit_evaluation
2
3submit_evaluation(
4 label="response_not_empty", # ^[a-zA-Z][a-zA-Z0-9_]*$
5 value=True, # bool -> boolean, int/float -> score, str -> categorical
6 assessment="pass", # optional: "pass" | "fail"
7 reasoning="The model returned a non-empty answer.",
8)2. Run an LLM-as-a-judge in code#
Use LLMJudge to evaluate model outputs with another model and submit the result in a single step with evaluate_and_submit. Because the SDK never imports a provider, you supply a thin client adapter for your model.
→ Full recipe: Evaluate with an LLM-as-judge
3. Write a custom (non-LLM) evaluator#
For deterministic checks — regex, JSON validity, length, similarity thresholds — use the @evaluator decorator for quick functions, or subclass BaseEvaluator for stateful/configurable evaluators.
→ Full recipe: Write a custom evaluator
Full reference#
The SDK guide contains the complete client-side evaluation API, including:
submit_evaluationLLMJudgeandAsyncLLMJudge- Structured output types
@evaluatorandBaseEvaluatorEvaluatorContextandEvaluatorResult- Flushing support for short-lived processes
→ Middleware SDK for Python — Evaluations
Where results appear#
Submitted evaluations are attached to the span they evaluate and appear in the span’s Evaluation tab in LLM Traces, just like server-side evaluation results.
The SDK exports evaluation results using the same schema as server-side evaluations, following the OpenTelemetry GenAI semantic conventions. Each evaluation is exported as:
- An OTLP log record containing the result and
gen_ai.evaluation.*attributes, correlated to the evaluated span throughtrace_idandspan_id. - Gauge metrics under
gen_ai.evaluations.*for dashboards, pass-rate tracking, and alerting.
Evaluation data is exported over OTLP/HTTP. For the exact attributes and metric names, see Data exported on the server-side page.
Need help? Contact the Middleware support team at [email protected] or join our Slack community.