Server-side Evaluations (UI)

Server-side evaluations run LLM-as-a-judge scoring within Middleware on live LLM trace spans. After you create and publish an evaluator, Middleware automatically evaluates matching spans and attaches the result, including the verdict, score, and reasoning, to each span. No code or SDK changes are required.

Use server-side evaluations for continuous quality and safety monitoring of production traffic, such as detecting toxic responses, prompt-injection attempts, off-topic answers, and unresolved user goals.

Server-side evaluations process spans as they are ingested into Middleware. They do not run against datasets or offline experiments. For offline/CI checks, deterministic (non-LLM) checks, or custom evaluation logic, use client-side evaluations with the SDK.

Before you begin#

You don’t need to bring your own LLM API key. Middleware runs the judge on its own OpenAI and Anthropic accounts, so you just pick a model. Judge tokens are billed against your account and counted in the Tokens section, with 2 million tokens included by default.

Add an evaluator#

  1. Open LLM Traces and go to the Evaluation tab.
  2. Click Add Evaluator.
  3. Pick a built-in template for a common check, or choose LLM-as-a-Judge to write your own from scratch.
Add Evaluator template gallery with built-in templates and a create-from-scratch LLM-as-a-Judge option

Built-in templates#

Each template is a ready-made LLM-as-a-judge with a tuned prompt, an output type, and pass/fail criteria. You pick the judge model when you create the evaluator, and you can edit the prompt and criteria too.

TemplateOutputWhat it checks
ToxicitycategoricalFlags harassment, hate, discriminatory, sexual, violent, or otherwise toxic content. Passes when the content is classified as Not Toxic.
Prompt InjectioncategoricalDetects prompt-injection attempts in the span input. Passes when the result is Not Prompt Injection.
Topic RelevancycategoricalWhether the message is on-topic for your application. Passes when the result is ON_TOPIC or NEUTRAL.
Sentiment AnalysiscategoricalLabels sentiment as Positive, Neutral, or Negative. Passes when the result is Positive or Neutral.
Failure AnswercategoricalClassifies empty, refusal, redirection, and other failure-to-answer patterns.
Failure Answer (Boolean)booleanWhether the assistant gave a substantive answer.
Goal CompletenessbooleanWhether every user intention in the conversation was resolved.
Tool Selection RelevancebooleanWhether the agent selected tools relevant to the user’s intent.
Tool Argument CorrectnessbooleanWhether tools were called with correct arguments matching their schema.

Create a custom LLM-as-a-judge#

Choose LLM-as-a-Judge to define your own evaluator. The configuration form has these sections.

Evaluator configuration showing details, scope, and prompt sections
  • Details — a unique name for the evaluator (used as the score’s name and feedback key), and the judge model to run it with.

  • Scope — which spans to evaluate:

    • Service name — limit the evaluator to one or more services. Leave it empty to match all services.
    • Sampling rate — the percentage of matching spans to evaluate (0–100). Lower it on high-volume applications to control judge cost.
    • Evaluate On — filter rules that decide which spans qualify. Each rule matches a span attribute against a value, for example gen_ai.operation.name Equal to chat to score only chat completions. Click Add Rule to combine conditions. With no rules, every span in scope is evaluated.

    To evaluate at the trace level (root spans only), add a rule on parent_span_id and leave its value empty, so the evaluator matches only spans with no parent.

    Evaluator Scope section with a service name field, sampling rate, and an Evaluate On filter rule matching gen_ai.operation.name equal to chat
  • Prompt — the judge instructions, split into a System prompt (the judge’s role and rubric) and a User prompt (what to evaluate). Reference the evaluated span with template variables:

    VariableResolves to
    {{span.input}}The user/input message of the span
    {{span.output}}The assistant/output message of the span
    {{span.messages}}The full input messages
    {{span.system_prompt}}The span’s system instructions
    {{span.model}}The model the span used
    {{span.provider}}The provider the span used

A typical user prompt is:

1Span Input: {{span.input}}
2Span Output: {{span.output}}

You can also map a prompt variable to any span attribute by key. This is a direct attribute mapping, not a JSONPath expression.

Output and acceptance criteria#

Evaluator configuration showing structured output response format and acceptance criteria
  • Structured Output — the Response Format the judge must return. Pick one and edit the JSON schema:

    • Category — one label from a defined set (for example, ON_TOPIC / OFF_TOPIC / NEUTRAL).
    • Boolean — a true/false result.
    • Score — a number within a defined range.

    Include a reasoning field in the schema so the judge's reasoning is stored alongside the result.

  • Acceptance Criteria — how a result maps to Pass or Fail:

    • For Category, list the passing categories (pass_values).
    • For Boolean, choose whether true or false is the passing value (pass_when).
    • For Score, set the passing range with a minimum and/or maximum threshold.

When you’re finished, click Save and Publish. Middleware starts running the evaluator on new matching spans. (You can also save it as a draft and publish later.)

Supported judge models#

The judge can be any of these models, all run on Middleware’s own provider accounts:

  • OpenAIgpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-4.1, gpt-4.1-mini, o1, o1-mini, o3, o3-mini, o4-mini.
  • Anthropicclaude-opus-4-6, claude-opus-4-5, claude-sonnet-4-6, claude-sonnet-4-5, claude-haiku-4-5, claude-3-5-sonnet-latest, claude-3-5-haiku-latest.

View results#

Once an evaluator is published, results attach to the spans it scores. Open a trace in LLM Traces, select a span, and open the Evaluation tab to see the verdict, value, and reasoning.

Span Evaluation tab showing an evaluation result with its assessment and the judge's reasoning
Raw evaluation attributes recorded on a span, including name, score value, verdict, and reasoning

Data exported#

Every evaluation is exported to Middleware in two forms, both following the OpenTelemetry GenAI semantic conventions (the gen_ai.* namespace). Client-side SDK evaluations export the same way, which is why both look identical in the UI.

Log record#

Each result is emitted as one OTLP log record, correlated to the evaluated span by trace_id and span_id. The log body is a JSON object with the full result (eval_name, score_value, score_label, verdict, explanation, judge_provider, judge_model, and judge token usage/cost_usd when available). The same fields are also set as log attributes so you can search and filter on them:

AttributeMeaning
gen_ai.evaluation.nameThe evaluator’s name
gen_ai.evaluation.score.valueThe numeric score value
gen_ai.evaluation.score.labelThe label or category
gen_ai.evaluation.verdictpass or fail
gen_ai.evaluation.explanationThe judge’s reasoning
gen_ai.evaluation.cost.usdEstimated judge cost, when token usage is known
eval.target.span_id / eval.target.trace_idThe span and trace that were evaluated
eval.model.provider / eval.model.nameThe judge model used

Metrics#

Each result also produces these gauge metrics, so you can chart pass rates over time and alert when quality drops:

MetricUnitMeaning
gen_ai.evaluations.count1One datapoint per evaluation run
gen_ai.evaluations.score1The numeric score. For boolean evals, 1 when the raw result is true; for categorical, 1 when the result passes the acceptance criteria, else 0; for score evals, the raw number
gen_ai.evaluations.outcome1One datapoint per evaluation, tagged with an outcome attribute of pass, fail, or error
gen_ai.evaluations.cost.usdUSDEstimated judge cost

Each metric carries name, label, verdict, service, model, and provider attributes for grouping.

Troubleshooting#

  • No results on spans — Confirm the evaluator is published (not a draft), that its service name and Evaluate On rules actually match your spans (an overly strict rule scopes everything out), and that the sampling rate isn’t so low that few spans are picked. Sampling below 100% means not every span gets a result.
  • Judge errors / no score — Confirm the selected judge model is one of the supported OpenAI/Anthropic models listed above. Also check the Tokens section: if your account has used its included token allowance, evaluations stop running until it’s topped up.
  • Empty input/output in the prompt — The judge reads {{span.input}} / {{span.output}} from the span’s GenAI attributes. Spans without GenAI input/output messages won’t give the judge anything to evaluate.

Need help? Contact the Middleware support team at [email protected] or join our Slack community.