Trace and evaluate a RAG pipeline

A RAG pipeline involves multiple stages: routing, retrieval, and generation. Each stage fails in its own way: the model can route wrong, the retriever can return irrelevant context, or the generator can ignore the context it was given. If you evaluate only the final answer, you can't determine which stage failed, which makes remediation difficult.

Instrument each stage as its own span and evaluate it independently.

How the trace is structured#

You want the three stages as three spans under one parent trace. Auto-instrumentation already covers the LLM calls, so you only add a decorator where it can't see the work:

  • The pipeline itself gets a @task(name="rag_pipeline") decorator. This opens the parent span that the stages below nest under, so they group into one trace.
  • Routing and generation are model calls, so register(auto_instrument=True) captures them as LLM spans automatically. No decorator needed.
  • Retrieval is a vector-store query, not a model call, so auto-instrumentation doesn't see it. Wrap it in @retriever and call annotate_rag() inside to record the query and the documents that came back.

The result is one trace with the intermediate states visible:

1task: rag_pipeline
2  ├── span: llm_routing     (auto)   input: user query        output: tool-call decision
3  ├── span: rag_retrieval   (manual) input: user query        output: retrieved documents
4  └── span: llm_generation  (auto)   input: query + context   output: final answer

@task is for non-model steps you want as a named span: the pipeline wrapper here, or any pre/post-processing. Model calls don't need it, auto-instrumentation already spans them.

What you'll need#

1pip install middleware-llmobs openinference-instrumentation-openai openai chromadb

Export your Middleware endpoint, key, and OPENAI_API_KEY as in Trace an LLM application.

The recipe#

1from openai import OpenAI
2from middleware.llmobs import register, task, retriever, annotate_rag
3import chromadb
4
5providers = register(service_name="rag-example", auto_instrument=True)
6client = OpenAI()
7
8# Seed a tiny vector store so the example runs end to end.
9collection = chromadb.Client().get_or_create_collection("docs")
10collection.add(
11    documents=[
12        "Middleware traces organize spans into parent-child workflows.",
13        "Evaluations attach a score to a span: boolean, numeric, or categorical.",
14        "The retriever span records the query and the documents it returned.",
15    ],
16    ids=["doc1", "doc2", "doc3"],
17)
18
19RETRIEVE_TOOL = {
20    "type": "function",
21    "function": {
22        "name": "retrieve_context",
23        "description": "Search the knowledge base for relevant information.",
24        "parameters": {
25            "type": "object",
26            "properties": {"query": {"type": "string"}},
27            "required": ["query"],
28        },
29    },
30}
31
32def route_query(query: str):
33    """Decide whether this query needs a knowledge-base lookup (auto-instrumented LLM span)."""
34    completion = client.chat.completions.create(
35        model="gpt-4o-mini",
36        messages=[
37            {"role": "system", "content": "Call retrieve_context only if the query needs external knowledge."},
38            {"role": "user", "content": query},
39        ],
40        tools=[RETRIEVE_TOOL],
41    )
42    return completion.choices[0].message
43
44@retriever(name="rag_retrieval")
45def retrieve_context(query: str, top_k: int = 3) -> list[dict]:
46    """Retrieve documents and record them on the retrieval span."""
47    results = collection.query(query_texts=[query], n_results=top_k)
48    documents = [{"id": i, "text": d} for i, d in zip(results["ids"][0], results["documents"][0])]
49    annotate_rag(query=query, documents=documents)   # query + documents land on this span
50    return documents
51
52def generate_answer(query: str, context: list[dict]) -> str:
53    """Generate the answer from the retrieved context (auto-instrumented LLM span)."""
54    context_str = "\n".join(f"- {d['text']}" for d in context)
55    completion = client.chat.completions.create(
56        model="gpt-4o-mini",
57        messages=[
58            {"role": "system", "content": f"Answer using only this context:\n{context_str}"},
59            {"role": "user", "content": query},
60        ],
61    )
62    return completion.choices[0].message.content
63
64@task(name="rag_pipeline")   # root span; the calls below nest under it
65def rag_pipeline(query: str) -> str:
66    routing = route_query(query)
67    context = retrieve_context(query) if routing.tool_calls else []
68    return generate_answer(query, context)
69
70print(rag_pipeline("How does Middleware handle tracing?"))
71providers.tracer.force_flush()

What you get#

Open LLM Traces and you'll see each request as a rag_pipeline trace with three child spans: the two model calls (named by the OpenAI instrumentation) and your rag_retrieval span between them. The retrieval span (gen_ai.operation.name = "retrieval") records both the query and the retrieved documents, so you can inspect exactly which context was available when the answer was generated.

A rag_pipeline trace in Middleware showing the parent task span over rag_retrieval and two ChatCompletion spans, with the retrieved documents recorded on the retrieval span

annotate_rag documents can be plain dicts or any object exposing id / score / text (or content) / name / metadata. If a decorator doesn't fit your code, call annotate_rag(query=..., documents=...) inside your own start_as_current_span block instead.

Evaluate each step#

Because each stage is its own span, you attach an evaluation to the stage you want to measure. A bad final answer now points at a specific cause.

StageWhat to checkA low score means
RoutingDid the model route correctly (call RAG when needed, skip it when not)?The routing prompt or tool schema does not clearly define when retrieval should occur.
RetrievalIs the retrieved context relevant to and complete for the query?Retrieval returned irrelevant or incomplete documents (embedding model, chunking, similarity threshold), or you need a higher top_k.
GenerationIs the answer grounded in the context, and did the model actually use it?The model is generating content not supported by the retrieved context, or ignoring it and answering from parametric knowledge.

Two ways to run these checks in Middleware:

  • Server-side (UI) — create an LLM-as-a-judge evaluator and scope it to specific spans using an Evaluate On filter (for example, gen_ai.operation.name = retrieval for the retrieval step). No code changes; Middleware scores matching spans automatically.
  • Client-side (SDK) — score a step in code and submit the result with submit_evaluation, or run your own judge with LLMJudge. This is the right choice for groundedness checks that compare the answer against the exact documents you retrieved.

For a groundedness judge that reads the retrieved context, see Evaluate with an LLM-as-judge.

Next steps#