Middleware LLM SDK - Setup Guide for Python
middleware-llmobs is Middleware’s first-party LLM Observability SDK for Python, built on OpenInference and OpenTelemetry. It does two things:
- Tracing — a one-line
register()configures an OpenTelemetryTracerProviderthat auto-instruments your AI/ML libraries and exports OpenInference LLM spans to your Middleware collector. - Evaluations — attach scores, verdicts, and LLM-as-judge results to your spans and traces, exported as OTel logs and metrics so you can monitor quality and cost alongside performance.
Transport: The SDK exports over OTLP/HTTP (HTTP/protobuf) only. gRPC is intentionally disabled — requesting it raises NotImplementedError.
Looking for end-to-end, copy-paste examples? See the Cookbooks for recipes covering tracing, tool/agent calls, RAG, sessions, and evaluations.
Before you begin#
You’ll need:
- A Python LLM app you can modify. Python
>=3.10, <3.15is required. - Your Middleware UID (for the tenant ingest endpoint) and Middleware API key (for the
Authorizationheader). - One or more OpenInference instrumentation packages for the providers and frameworks you use. The SDK has zero LLM-provider dependencies — it never imports
openai,anthropic, etc. You bring your own provider client and install the matching instrumentation (see Step 2 below).
1. Install the SDK#
1pip install middleware-llmobs2. Install instrumentation for your models and frameworks#
Auto-instrumentation is opt-in per library: register(auto_instrument=True) only traces a library when its OpenInference instrumentation package is installed. Install one package for each LLM provider, vector store, and agent framework you want traced.
The packages follow the naming pattern openinference-instrumentation-<library>. Install one for each LLM provider and agent framework you use — for example, to trace OpenAI calls made from a LangChain app, install both:
1# LLM provider
2pip install openinference-instrumentation-openai
3
4# Agent / orchestration framework
5pip install openinference-instrumentation-langchainA few more examples (Anthropic, Bedrock, LlamaIndex, CrewAI):
1pip install openinference-instrumentation-anthropic
2pip install openinference-instrumentation-bedrock
3pip install openinference-instrumentation-llama-index
4pip install openinference-instrumentation-crewaiOpenInference ships instrumentation for many more LLM providers (Anthropic, Bedrock, VertexAI, Gemini, Mistral, Groq, LiteLLM, …) and agent frameworks (LlamaIndex, CrewAI, Haystack, DSPy, AutoGen, Google ADK, OpenAI Agents, Pydantic AI, smolagents, MCP, …). For the full, up-to-date list of packages, see the OpenInference instrumentation directory.
3. Configure your credentials#
register() reads standard OTLP environment variables. The cleanest setup is to export your endpoint and key, and pass the service name in code:
1export OTEL_EXPORTER_OTLP_ENDPOINT="https://<MW_UID>.middleware.io:443"
2export OTEL_EXPORTER_OTLP_HEADERS="Authorization=<MW_API_KEY>,X-Trace-Source=openinference"
3export OTEL_SERVICE_NAME="my-llm-app"Authorization carries your Middleware API key. X-Trace-Source=openinference tells Middleware these spans come from the OpenInference-based SDK so it routes them correctly. The paths /v1/traces, /v1/logs, and /v1/metrics are appended to the endpoint automatically — pass only the base URL.
4. Register and auto-instrument#
Call register() once, as early as possible in your app’s startup, before you import or call your LLM libraries:
1from middleware.llmobs import register
2
3providers = register(auto_instrument=True) # batch + HTTP by default
4# providers.tracer / providers.logger / providers.meter — hold onto this to flush at shutdown.That single call:
- Builds an OpenTelemetry tracer, logger, and meter provider configured for your Middleware endpoint, and installs them as the OTel globals.
- With
auto_instrument=True, discovers every installed OpenInference instrumentor and attaches it with GenAI semantic conventions enabled (so spans carrygen_ai.*attributes).
Now every instrumented call is traced automatically:
1import openai
2
3client = openai.OpenAI()
4response = client.chat.completions.create( # automatically traced
5 model="gpt-4o-mini",
6 messages=[{"role": "user", "content": "Hello!"}],
7)Production tip: register() uses a batch span processor by default (batch=True), which exports spans in the background so your application isn’t blocked. Keep batching on in production; use batch=False only for quick local verification.
Explicit configuration#
You can pass endpoint, headers, and service name directly instead of using environment variables:
1from middleware.llmobs import register
2
3providers = register(
4 endpoint="https://<MW_UID>.middleware.io:443",
5 headers={"Authorization": "<MW_API_KEY>", "X-Trace-Source": "openinference"},
6 service_name="my-llm-app",
7 auto_instrument=True,
8)Never hardcode your Middleware ingestion key in committed code. Prefer the OTEL_EXPORTER_OTLP_HEADERS environment variable.
Manual instrumentor wiring#
If you’d rather attach instrumentors yourself instead of using auto_instrument, register without it and instrument explicitly. This is useful when you want to enable only specific libraries or pass a custom TraceConfig:
1from middleware.llmobs import register
2from openinference.instrumentation import TraceConfig
3from openinference.instrumentation.openai import OpenAIInstrumentor
4
5providers = register(service_name="my-llm-app")
6
7OpenAIInstrumentor().instrument(
8 tracer_provider=providers.tracer,
9 config=TraceConfig(enable_genai_semconv=True),
10)Configuration reference#
Each setting can be passed to register() or set via an environment variable (the argument takes precedence):
register() argument | Environment variable | Purpose |
|---|---|---|
endpoint | OTEL_EXPORTER_OTLP_ENDPOINT | Collector endpoint, e.g. https://<MW_UID>.middleware.io:443. |
headers | OTEL_EXPORTER_OTLP_HEADERS | Export headers: the Authorization key, plus X-Trace-Source=openinference. As a dict {"Authorization": "<key>", "X-Trace-Source": "openinference"} or env string Authorization=<key>,X-Trace-Source=openinference. |
service_name | OTEL_SERVICE_NAME | Service identity (defaults to "default"). |
project_name | MW_PROJECT_NAME | Optional project name (defaults to the service name). |
Other flags accepted by register():
auto_instrument(bool) — discover and attach all installed OpenInference instrumentors.batch(bool, defaultTrue) — use a batch span processor; setFalsefor a simple processor during local debugging.verbose(bool, defaultTrue) — print a configuration summary at startup.set_global_tracer_provider/set_global_logger_provider/set_global_meter_provider— each defaults toTrue; setFalseto manage the OTel globals yourself.- Any extra keyword arguments pass through to the underlying
TracerProvider(for example a customresource=,sampler=, orid_generator=).
5. Naming traces and enriching spans#
Auto-instrumentation adds an LLM child span for each provider call. Wrap your call in a parent span to give the whole trace a clear name:
1from opentelemetry import trace
2
3tracer = trace.get_tracer(__name__)
4
5with tracer.start_as_current_span("chat"):
6 response = client.chat.completions.create(
7 model="gpt-4o-mini",
8 messages=[{"role": "user", "content": "Summarise OpenTelemetry in one sentence."}],
9 )Session and user context#
The SDK re-exports OpenInference context managers to add session, user, and metadata context to every span created inside them — useful for grouping a multi-turn conversation or attributing traffic to a user:
1from middleware.llmobs import using_session, using_user
2
3with using_session(session_id="abc-123"), using_user(user_id="user-456"):
4 response = client.chat.completions.create(...)Also available: using_attributes, using_metadata, using_tags, using_prompt_template, and suppress_tracing (a context manager that temporarily disables tracing).
Span decorators for pipeline steps#
The SDK ships decorators that open a span around a function and tag it with the right GenAI operation. They work on both sync and async functions, and can be used bare (@task) or with arguments (@task(name="...")):
@task— a generic step in your pipeline.@retriever— a retrieval (RAG) step. Callannotate_rag()inside it to record the query and the documents you got back.@embedding— an embedding call.
1from middleware.llmobs import task, retriever, embedding, annotate_rag
2
3@task
4def build_prompt(question: str) -> str:
5 return f"Answer concisely: {question}"
6
7@retriever
8def get_relevant_docs(question: str):
9 docs = [
10 {"id": d.id, "score": d.score, "text": d.text}
11 for d in vector_db.search(question)
12 ]
13 annotate_rag(query=question, documents=docs)
14 return docs
15
16@embedding
17def embed(text: str):
18 return embedding_client.create(model="text-embedding-3-small", input=text)You can also call annotate_rag(query=..., documents=...) inside your own start_as_current_span block when a decorator doesn’t fit. Documents can be plain dicts or any object exposing id / score / text (or content) / name / metadata.
6. Evaluations#
Evaluations attach a score or verdict to a span or trace, and ship as OTel logs and gauge metrics (gen_ai.evaluations.*) over the same HTTP transport. There are three ways in, from lowest to highest level.
Submit a value you already computed#
When you’ve already computed a result, submit_evaluation records it. By default it auto-binds to the active span, so call it inside your start_as_current_span block:
1from middleware.llmobs import submit_evaluation
2
3with tracer.start_as_current_span("chat"):
4 response = client.chat.completions.create(...)
5
6 submit_evaluation(
7 label="response_not_empty", # must match ^[a-zA-Z][a-zA-Z0-9_]*$
8 value=True, # int/float (score), bool (boolean), or str (categorical)
9 metric_type="boolean", # inferred from the value if omitted
10 assessment="pass", # optional: "pass" | "fail"
11 reasoning="The model returned a non-empty answer.",
12 tags={"feature": "chat"},
13 )metric_type is inferred from the value when omitted: bool → boolean, int/float → score, str → categorical. To attach an eval to a specific span instead of the active one, pass span_id= and trace_id= together (as hex strings), or use join_on_tag=("key", "value"). Use export_current_span() to capture the current IDs for later submission, and submit_evaluation_error(label=..., error=...) to record a failed evaluation so it stays visible in dashboards.
LLM-as-judge#
For quality checks scored by a model, build an LLMJudge. Because the SDK never imports a provider, you write a thin client adapter that calls your model and returns its text response. Use format_schema_for_provider to turn the SDK’s canonical JSON schema into your provider’s request kwargs:
1from openai import OpenAI
2from middleware.llmobs import (
3 LLMJudge,
4 BooleanStructuredOutput,
5 EvaluatorContext,
6 evaluate_and_submit,
7 format_schema_for_provider,
8)
9
10openai_client = OpenAI()
11
12def openai_judge_client(messages, model, json_schema=None, model_params=None) -> str:
13 kwargs = {"model": model, "messages": messages}
14 if model_params:
15 kwargs.update(model_params)
16 if json_schema:
17 kwargs.update(format_schema_for_provider(json_schema, "openai"))
18 resp = openai_client.chat.completions.create(**kwargs)
19 return resp.choices[0].message.content or ""
20
21toxicity_judge = LLMJudge(
22 client=openai_judge_client,
23 model="gpt-4o-mini",
24 name="toxicity",
25 system_prompt="You are a strict content-safety classifier. Respond ONLY in the requested JSON shape.",
26 user_prompt="Classify this message as toxic (true) or not.\n\nMessage:\n{{output}}",
27 structured_output=BooleanStructuredOutput(
28 description="True if the message is toxic.",
29 reasoning=True,
30 pass_when=False, # the eval PASSES when the value is False (= not toxic)
31 ),
32 model_params={"temperature": 0.0, "max_tokens": 200},
33)
34
35ctx = EvaluatorContext(input=customer_msg, output=agent_reply)
36result = evaluate_and_submit(toxicity_judge, ctx) # runs the judge AND submits in one callThe user_prompt uses {{field.path}} placeholders rendered from the EvaluatorContext: {{input}}, {{output}}, {{expected_output}}, {{metadata.foo}}, and so on. format_schema_for_provider supports "openai", "azure_openai", "anthropic", "vertexai", and "bedrock".
Choose a structured-output type to drive both the JSON schema and the pass/fail logic:
| Type | Judge returns | Pass logic |
|---|---|---|
BooleanStructuredOutput(description, reasoning=False, pass_when=None) | true / false | pass_when sets which value passes. |
ScoreStructuredOutput(description, min_score, max_score, min_threshold=None, max_threshold=None) | a number | thresholds define the passing range. |
CategoricalStructuredOutput(categories={value: desc, ...}, pass_values=None) | one category | pass_values lists the passing ones. |
For async judges, use AsyncLLMJudge with an async def client adapter and await aevaluate_and_submit(judge, ctx). Running several async judges with asyncio.gather(...) on the same span attaches multiple evaluations to one trace.
Custom evaluators#
For non-LLM checks (regex, JSON-validity, similarity, thresholds), use the @evaluator decorator. It captures exceptions into the result instead of propagating them, and normalizes the return value:
1from middleware.llmobs import evaluator, EvaluatorContext, evaluate_and_submit
2
3@evaluator # or @evaluator(name="custom_name")
4def has_answer(ctx: EvaluatorContext):
5 return bool(ctx.output.strip()) # bool→boolean, int/float→score, str→categorical, None→skip
6
7evaluate_and_submit(has_answer, EvaluatorContext(input=question, output=answer))For stateful or configurable evaluators, subclass BaseEvaluator (or AsyncBaseEvaluator) and implement evaluate(self, ctx):
1from middleware.llmobs import BaseEvaluator, EvaluatorResult
2
3class SimilarityEvaluator(BaseEvaluator):
4 def __init__(self, threshold=0.8):
5 super().__init__(name="similarity")
6 self.threshold = threshold
7
8 def evaluate(self, ctx):
9 score = cosine_sim(ctx.output, ctx.expected_output)
10 return EvaluatorResult(
11 value=score,
12 assessment="pass" if score >= self.threshold else "fail",
13 )evaluate(...) may run concurrently across threads or tasks — don’t mutate instance attributes inside it; use locals. Use @async_evaluator (with aevaluate_and_submit) for async def evaluators.
EvaluatorContext carries the fields your evaluators read: input, output, expected_output, retrieved_contexts, metadata, tags, span_id, and trace_id.
7. Flush before short-lived processes exit#
For scripts and serverless functions, drain the processors before the process ends, or spans and evaluations may be lost:
1from middleware.llmobs import flush_evaluations
2
3providers.tracer.force_flush() # spans
4flush_evaluations() # eval logs + metrics (if you submitted any)Long-running servers don’t need this on every request — flushing is handled by the batch processor and at shutdown.
View your data in Middleware#
Run your app and trigger at least one LLM request. Open the LLM Observability section in Middleware to see your traces, spans, and evaluation results. Drill into a trace to inspect prompts, responses, token usage, and any evaluations attached to it.
Troubleshooting & Common Pitfalls#
- Nothing shows up in the UI → Confirm
OTEL_EXPORTER_OTLP_ENDPOINTandOTEL_EXPORTER_OTLP_HEADERSare correct, thatregister()runs before your LLM calls, and that you actually execute a request after startup. - No spans from a provider or framework →
auto_instrument=Trueonly traces libraries whoseopeninference-instrumentation-*package is installed. Install the matching package (Step 2) for each one you use. - Spans missing from a short script → Call
providers.tracer.force_flush()(andflush_evaluations()if you submitted evals) before the process exits. - gRPC errors → This SDK is HTTP/protobuf only. Don’t pass
protocol="grpc"or a gRPC-style endpoint.
Need assistance or want to learn more about using the Middleware LLM SDK? Contact our support team at [email protected] or join our Slack channel.