Server-side Evaluations (UI)

Server-side evaluations run LLM-as-a-judge scoring within Middleware on live LLM trace spans. After you create and publish an evaluator, Middleware automatically evaluates matching spans and attaches the result, including the verdict, score, and reasoning, to each span. No code or SDK changes are required.

Use server-side evaluations for continuous quality and safety monitoring of production traffic, such as detecting toxic responses, prompt-injection attempts, off-topic answers, and unresolved user goals.

Server-side evaluations process spans as they are ingested into Middleware. They do not run against datasets or offline experiments. For offline/CI checks, deterministic (non-LLM) checks, or custom evaluation logic, use client-side evaluations with the SDK.

Before you begin#

LLM traces flowing into Middleware (see LLM Observability Overview).

You don’t need to bring your own LLM API key. Middleware runs the judge on its own OpenAI and Anthropic accounts, so you just pick a model. Judge tokens are billed against your account and counted in the Tokens section, with 2 million tokens included by default.

Add an evaluator#

Open LLM Traces and go to the Evaluation tab.
Click Add Evaluator.
Pick a built-in template for a common check, or choose LLM-as-a-Judge to write your own from scratch.

Built-in templates#

Each template is a ready-made LLM-as-a-judge with a tuned prompt, an output type, and pass/fail criteria. You pick the judge model when you create the evaluator, and you can edit the prompt and criteria too.

Template	Output	What it checks
Toxicity	categorical	Flags harassment, hate, discriminatory, sexual, violent, or otherwise toxic content. Passes when the content is classified as `Not Toxic`.
Prompt Injection	categorical	Detects prompt-injection attempts in the span input. Passes when the result is `Not Prompt Injection`.
Topic Relevancy	categorical	Whether the message is on-topic for your application. Passes when the result is `ON_TOPIC` or `NEUTRAL`.
Sentiment Analysis	categorical	Labels sentiment as Positive, Neutral, or Negative. Passes when the result is `Positive` or `Neutral`.
Failure Answer	categorical	Classifies empty, refusal, redirection, and other failure-to-answer patterns.
Failure Answer (Boolean)	boolean	Whether the assistant gave a substantive answer.
Goal Completeness	boolean	Whether every user intention in the conversation was resolved.
Tool Selection Relevance	boolean	Whether the agent selected tools relevant to the user’s intent.
Tool Argument Correctness	boolean	Whether tools were called with correct arguments matching their schema.

Create a custom LLM-as-a-judge#

Choose LLM-as-a-Judge to define your own evaluator. The configuration form has these sections.

Evaluator configuration showing details, scope, and prompt sections

Details — a unique name for the evaluator (used as the score’s name and feedback key), and the judge model to run it with.
Scope — which spans to evaluate:
- Service name — limit the evaluator to one or more services. Leave it empty to match all services.
- Sampling rate — the percentage of matching spans to evaluate (0–100). Lower it on high-volume applications to control judge cost.
- Evaluate On — filter rules that decide which spans qualify. Each rule matches a span attribute against a value, for example gen_ai.operation.name Equal to chat to score only chat completions. Click Add Rule to combine conditions. With no rules, every span in scope is evaluated.
To evaluate at the trace level (root spans only), add a rule on parent_span_id and leave its value empty, so the evaluator matches only spans with no parent.

Prompt — the judge instructions, split into a System prompt (the judge’s role and rubric) and a User prompt (what to evaluate). Reference the evaluated span with template variables:

Variable	Resolves to
`{{span.input}}`	The user/input message of the span
`{{span.output}}`	The assistant/output message of the span
`{{span.messages}}`	The full input messages
`{{span.system_prompt}}`	The span’s system instructions
`{{span.model}}`	The model the span used
`{{span.provider}}`	The provider the span used

A typical user prompt is:

1Span Input: {{span.input}}
2Span Output: {{span.output}}

You can also map a prompt variable to any span attribute by key. This is a direct attribute mapping, not a JSONPath expression.

Output and acceptance criteria#

Evaluator configuration showing structured output response format and acceptance criteria

Structured Output — the Response Format the judge must return. Pick one and edit the JSON schema:
- Category — one label from a defined set (for example, ON_TOPIC / OFF_TOPIC / NEUTRAL).
- Boolean — a true/false result.
- Score — a number within a defined range.
Include a reasoning field in the schema so the judge's reasoning is stored alongside the result.
Acceptance Criteria — how a result maps to Pass or Fail:
- For Category, list the passing categories (pass_values).
- For Boolean, choose whether true or false is the passing value (pass_when).
- For Score, set the passing range with a minimum and/or maximum threshold.

When you’re finished, click Save and Publish. Middleware starts running the evaluator on new matching spans. (You can also save it as a draft and publish later.)

Supported judge models#

The judge can be any of these models, all run on Middleware’s own provider accounts:

OpenAI — gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-4.1, gpt-4.1-mini, o1, o1-mini, o3, o3-mini, o4-mini.
Anthropic — claude-opus-4-6, claude-opus-4-5, claude-sonnet-4-6, claude-sonnet-4-5, claude-haiku-4-5, claude-3-5-sonnet-latest, claude-3-5-haiku-latest.

View results#

Once an evaluator is published, results attach to the spans it scores. Open a trace in LLM Traces, select a span, and open the Evaluation tab to see the verdict, value, and reasoning.

Span Evaluation tab showing an evaluation result with its assessment and the judge's reasoning

Raw evaluation attributes recorded on a span, including name, score value, verdict, and reasoning

Data exported#

Every evaluation is exported to Middleware in two forms, both following the OpenTelemetry GenAI semantic conventions (the gen_ai.* namespace). Client-side SDK evaluations export the same way, which is why both look identical in the UI.

Log record#

Each result is emitted as one OTLP log record, correlated to the evaluated span by trace_id and span_id. The log body is a JSON object with the full result (eval_name, score_value, score_label, verdict, explanation, judge_provider, judge_model, and judge token usage/cost_usd when available). The same fields are also set as log attributes so you can search and filter on them:

Attribute	Meaning
`gen_ai.evaluation.name`	The evaluator’s name
`gen_ai.evaluation.score.value`	The numeric score value
`gen_ai.evaluation.score.label`	The label or category
`gen_ai.evaluation.verdict`	`pass` or `fail`
`gen_ai.evaluation.explanation`	The judge’s reasoning
`gen_ai.evaluation.cost.usd`	Estimated judge cost, when token usage is known
`eval.target.span_id` / `eval.target.trace_id`	The span and trace that were evaluated
`eval.model.provider` / `eval.model.name`	The judge model used

Metrics#

Each result also produces these gauge metrics, so you can chart pass rates over time and alert when quality drops:

Metric	Unit	Meaning
`gen_ai.evaluations.count`	`1`	One datapoint per evaluation run
`gen_ai.evaluations.score`	`1`	The numeric score. For boolean evals, `1` when the raw result is true; for categorical, `1` when the result passes the acceptance criteria, else `0`; for score evals, the raw number
`gen_ai.evaluations.outcome`	`1`	One datapoint per evaluation, tagged with an `outcome` attribute of `pass`, `fail`, or `error`
`gen_ai.evaluations.cost.usd`	`USD`	Estimated judge cost

Each metric carries name, label, verdict, service, model, and provider attributes for grouping.

Troubleshooting#

No results on spans — Confirm the evaluator is published (not a draft), that its service name and Evaluate On rules actually match your spans (an overly strict rule scopes everything out), and that the sampling rate isn’t so low that few spans are picked. Sampling below 100% means not every span gets a result.
Judge errors / no score — Confirm the selected judge model is one of the supported OpenAI/Anthropic models listed above. Also check the Tokens section: if your account has used its included token allowance, evaluations stop running until it’s topped up.
Empty input/output in the prompt — The judge reads {{span.input}} / {{span.output}} from the span’s GenAI attributes. Spans without GenAI input/output messages won’t give the judge anything to evaluate.

Need help? Contact the Middleware support team at [email protected] or join our Slack community.