- Why did the agent choose that tool instead of this one?
- What context was actually passed to the LLM when it generated that response?
- Where is all the latency coming from - is it the model, the retrieval, or something else?
- The user got a wrong answer, but which step in the pipeline failed?
Follow along with code: This guide has a companion TypeScript project with runnable examples. Find it here.
SupportBot
Our sample support agent for this tutorial:- Classifies incoming queries (order status vs. FAQ)
- Routes to the appropriate handler:
- Order Status: Use a tool to look up order information, then summarize for the customer
- FAQ: Search a knowledge base with embeddings, then generate an answer using RAG
Setting Up Tracing
First, install the dependencies and configure OpenTelemetry to send traces to Phoenix.Install Dependencies
Set up Phoenix Cloud
In order to send traces to Phoenix, you must sign up for a free space and account. Follow these instructions to configure Phoenix Cloud, if you haven’t already.Configure OpenTelemetry
Create aninstrumentation.ts file that sends traces to Phoenix:
Tracing LLM Calls
Every LLM call is a decision point. What prompt did the model receive? What did it output? How long did it take, and how many tokens did it use? Without tracing, you’re forced to build your own logging or debugging, and therefore miss out on key data that would block you from full observability. With tracing, you get a complete record of every LLM interaction, including- input messages (system, user, assistant prompt)
- LLM output
- model name, model provider
- invocation parameters
- token counts
- latency
experimental_telemetry: { isEnabled: true }. Add this to any generateText or embed call:
Tracing Tool Calls
Tools allow your agent to interact with databases, APIs, external systems. In order to gain insight into how your tools are performing, you need to answer questions like- Did the LLM decide to call the right tool?
- Did it extract the parameters correctly?
- Did the tool return what you expected?
experimental_telemetry is enabled.
- LLM Span: Model decides to call
lookupOrderStatus - Tool Span: Shows the tool name, input (
orderId), and output - LLM Span: Model summarizes the result
Tracing RAG Pipelines
RAG pipelines can fail in many places. The embedding might not capture the query’s intent, the retrieval might return irrelevant documents, or the LLM might misuse good context. When a user gets a bad answer, which step failed? With tracing, you can see the full pipeline, including which documents were retrieved, what context was injected into the prompt, and how the LLM used it. You can pinpoint exactly where things went wrong. For RAG, trace both the embedding calls and the generation call. Eachembed call becomes its own span:
Grouping Operations with Parent Spans
A single user request might trigger multiple LLM calls, tool executions, and retrievals. Let’s allocate all of these under one parent span, so all operations for one request are nested together. Click on the parent span and see the entire execution tree: classification, tool calls, retrieval, generation, all in one view, with timing relationships visible at a glance.See the entire agent with grouped tracing here.
Running the Demo
The final SupportBot agent combines the classifier, the order status tool, and the FAQ retrieval into a single agent. You can see the code here, and run it with:s) and focus on the traces.
Viewing Your Traces
Open your Phoenix Cloud space. You’ll see 7support-agent traces - one for each query.
Click into any trace to see the full execution tree. Let’s look at two interesting cases:
Trace 1: “Can you help me with something random?”
Our support query classifier gave the following classification:Trace 2: “What’s the status of order ORD-99999?”
Our support query classifier gave us the following classification:Summary
Congratulations! In this tutorial, you learned how to:- Trace LLM calls - Capture inputs, outputs, tokens, and latency with
experimental_telemetry: { isEnabled: true } - Trace tool calls - See tool decisions, parameters, and responses as child spans
- Trace RAG pipelines - Monitor embeddings and see retrieved context in generation prompts
- Group with parent spans - Nest all operations for a request into one trace
- View and analyze traces - Debug agent behavior by exploring execution trees in Phoenix
Next Steps
You can see inside your application now - every LLM call, tool execution, and retrieval is visible. We spent some time manually analyzing traces. But how can we automate this analysis, over thousands of traces? How can we store this analysis in Phoenix, so that we can build metrics that measure our application? In the next chapter, you’ll learn to:- Annotate traces to mark quality issues
- Capture user feedback (thumbs up/down) and attach it to traces
- Run automated LLM-as-Judge evaluations to find patterns in what’s failing

