Client-side Evaluations (SDK)

Client-side evaluations run in your application code using the middleware-llmobs SDK. You define how responses are evaluated, whether by submitting a computed value, running your own LLM-as-a-judge, or executing a deterministic check. The SDK then attaches the result to the trace span and exports it to Middleware via OTLP.

Use client-side evaluations when you need logic that lives in your app: RAG-aware checks, offline or CI evaluation, deterministic checks (regex, JSON validity, similarity thresholds), or a judge you fully control.

For no-code, always-on evaluation of production traffic, use server-side evaluations instead. Middleware runs the judge for you.

Score value types#

An evaluation result can use one of three value types, the same types used everywhere in Middleware evaluations:

boolean — a pass/fail result (e.g. a content-safety check).
score — a number, with a passing range you define.
categorical — a label from a defined set.

Each can carry an optional pass/fail assessment and the reasoning behind it.

Three ways to evaluate#

1. Submit a precomputed result#

If your application has already computed an evaluation result, record it with submit_evaluation. By default it binds to the active span:

1from middleware.llmobs import submit_evaluation
2
3submit_evaluation(
4    label="response_not_empty",   # ^[a-zA-Z][a-zA-Z0-9_]*$
5    value=True,                   # bool -> boolean, int/float -> score, str -> categorical
6    assessment="pass",            # optional: "pass" | "fail"
7    reasoning="The model returned a non-empty answer.",
8)

2. Run an LLM-as-a-judge in code#

Use LLMJudge to evaluate model outputs with another model and submit the result in a single step with evaluate_and_submit. Because the SDK never imports a provider, you supply a thin client adapter for your model.

→ Full recipe: Evaluate with an LLM-as-judge

3. Write a custom (non-LLM) evaluator#

For deterministic checks — regex, JSON validity, length, similarity thresholds — use the @evaluator decorator for quick functions, or subclass BaseEvaluator for stateful/configurable evaluators.

→ Full recipe: Write a custom evaluator

Full reference#

The SDK guide contains the complete client-side evaluation API, including:

submit_evaluation
LLMJudge and AsyncLLMJudge
Structured output types
@evaluator and BaseEvaluator
EvaluatorContext and EvaluatorResult
Flushing support for short-lived processes

→ Middleware SDK for Python — Evaluations

Where results appear#

Submitted evaluations are attached to the span they evaluate and appear in the span’s Evaluation tab in LLM Traces, just like server-side evaluation results.

The SDK exports evaluation results using the same schema as server-side evaluations, following the OpenTelemetry GenAI semantic conventions. Each evaluation is exported as:

An OTLP log record containing the result and gen_ai.evaluation.* attributes, correlated to the evaluated span through trace_id and span_id.
Gauge metrics under gen_ai.evaluations.* for dashboards, pass-rate tracking, and alerting.

Evaluation data is exported over OTLP/HTTP. For the exact attributes and metric names, see Data exported on the server-side page.

Need help? Contact the Middleware support team at [email protected] or join our Slack community.