Skip to main content
When building agents and LLM applications, it’s hard to see what’s actually going on under the hood. Even if you set up a comprehensive agent architecture with multiple prompts, descriptive tools, and data retrievals, you’re left answering questions like:
  • Why did the agent choose that tool instead of this one?
  • What context was actually passed to the LLM when it generated that response?
  • Where is all the latency coming from - is it the model, the retrieval, or something else?
  • The user got a wrong answer, but which step in the pipeline failed?
In this tutorial, with just a few additional lines of code, you’ll be able to monitor every LLM call, tool execution, and retrieval operation that powers your agents. You’ll learn how to debug, monitor, and analyze your agents more effectively and efficiently, transforming them from personal projects to production-ready applications.
Follow along with code: This guide has a companion TypeScript project with runnable examples. Find it here.

SupportBot

Our sample support agent for this tutorial:
  1. Classifies incoming queries (order status vs. FAQ)
  2. Routes to the appropriate handler:
    • Order Status: Use a tool to look up order information, then summarize for the customer
    • FAQ: Search a knowledge base with embeddings, then generate an answer using RAG
The issue is that our users are complaining. Responses are slow, answers are wrong, but we have no idea why. Our agent is a black box - we can see what the user asked, and see how the agent replied, but we don’t have visibility into the individual components of our agent that actually ran. Let’s set up tracing to gain visibility.

Setting Up Tracing

First, install the dependencies and configure OpenTelemetry to send traces to Phoenix.

Install Dependencies

npm install ai @ai-sdk/openai @arizeai/openinference-vercel \
  @arizeai/openinference-semantic-conventions @opentelemetry/api \
  @opentelemetry/sdk-trace-node @opentelemetry/exporter-trace-otlp-proto \
  @opentelemetry/resources @opentelemetry/semantic-conventions zod

Set up Phoenix Cloud

In order to send traces to Phoenix, you must sign up for a free space and account. Follow these instructions to configure Phoenix Cloud, if you haven’t already.

Configure OpenTelemetry

Create an instrumentation.ts file that sends traces to Phoenix:
import { register } from "@arizeai/phoenix-otel";

// Register with Phoenix - this handles all the OpenTelemetry boilerplate
export const provider = register({
  projectName: "support-bot",
});
Import this file at the top of your application to enable tracing.

Tracing LLM Calls

Every LLM call is a decision point. What prompt did the model receive? What did it output? How long did it take, and how many tokens did it use? Without tracing, you’re forced to build your own logging or debugging, and therefore miss out on key data that would block you from full observability. With tracing, you get a complete record of every LLM interaction, including
  • input messages (system, user, assistant prompt)
  • LLM output
  • model name, model provider
  • invocation parameters
  • token counts
  • latency
For SupportBot, tracing LLM calls will give us observability into the classification stage. We’ll see exactly what class all our support queries are, and what lead to that classification. It will also give us observability into the final generation stage. How was the final output delivered to the user, and what context went into that final LLM call that generated the final output? The key to tracing AI SDK calls is one parameter: experimental_telemetry: { isEnabled: true }. Add this to any generateText or embed call:
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";

const result = await generateText({
  model: openai.chat("gpt-4o-mini"),
  system: "Classify the query as 'order_status' or 'faq'",
  prompt: userQuery,
  experimental_telemetry: { isEnabled: true },  // This enables tracing
});
In Phoenix, you’ll see:

Tracing Tool Calls

Tools allow your agent to interact with databases, APIs, external systems. In order to gain insight into how your tools are performing, you need to answer questions like
  • Did the LLM decide to call the right tool?
  • Did it extract the parameters correctly?
  • Did the tool return what you expected?
With tracing, you see the complete chain, including the LLM’s decision, the exact parameters passed, and the tool’s response, without having to guess which step broke. When your LLM calls tools, those executions are automatically traced as child spans. With Phoenix, you can simply define your tools using the AI SDK configuration, as the tracing happens automatically when experimental_telemetry is enabled.
const result = await generateText({
  model: openai.chat("gpt-4o-mini"),
  prompt: userQuery,
  tools: {
    lookupOrderStatus: tool({
      description: "Look up order status by order ID",
      inputSchema: z.object({
        orderId: z.string(),
      }),
      execute: async ({ orderId }) => {
        // Your tool logic here
        return orderDatabase[orderId];
      },
    }),
  },
  maxSteps: 2,
  experimental_telemetry: { isEnabled: true },  // Tools are traced automatically
});
In Phoenix, you’ll see:
  1. LLM Span: Model decides to call lookupOrderStatus
  2. Tool Span: Shows the tool name, input (orderId), and output
  3. LLM Span: Model summarizes the result

Tracing RAG Pipelines

RAG pipelines can fail in many places. The embedding might not capture the query’s intent, the retrieval might return irrelevant documents, or the LLM might misuse good context. When a user gets a bad answer, which step failed? With tracing, you can see the full pipeline, including which documents were retrieved, what context was injected into the prompt, and how the LLM used it. You can pinpoint exactly where things went wrong. For RAG, trace both the embedding calls and the generation call. Each embed call becomes its own span:
// Embed the user's query - traced automatically
const { embedding } = await embed({
  model: openai.embedding("text-embedding-ada-002"),
  value: userQuery,
  experimental_telemetry: { isEnabled: true },
});

// ... semantic search logic ...

// Generate with retrieved context - traced automatically
const { text } = await generateText({
  model: openai.chat("gpt-4o-mini"),
  system: `Answer using ONLY this context:\n\n${retrievedContext}`,
  prompt: userQuery,
  experimental_telemetry: { isEnabled: true },
});
In Phoenix, you’ll see: The generation span shows the retrieved context in the system prompt, so you can immediately see if retrieval found the right documents.

Grouping Operations with Parent Spans

A single user request might trigger multiple LLM calls, tool executions, and retrievals. Let’s allocate all of these under one parent span, so all operations for one request are nested together. Click on the parent span and see the entire execution tree: classification, tool calls, retrieval, generation, all in one view, with timing relationships visible at a glance.
See the entire agent with grouped tracing here.
To see all operations for a single request as one trace, wrap them in a parent span using the OpenTelemetry API:
import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("support-agent");

async function handleSupportQuery(userQuery: string) {
  return tracer.startActiveSpan(
    "support-agent",
    { attributes: { "openinference.span.kind": "AGENT", "input.value": userQuery } },
    async (span) => {
      try {
        // All LLM calls, tool executions, and embeddings inside here
        // will appear as children of this span
        const result = await processQuery(userQuery);
        
        span.setAttribute("output.value", result);
        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (error) {
        span.setStatus({ code: SpanStatusCode.ERROR });
        throw error;
      } finally {
        span.end();
      }
    }
  );
}

Running the Demo

The final SupportBot agent combines the classifier, the order status tool, and the FAQ retrieval into a single agent. You can see the code here, and run it with:
pnpm start
The tutorial code runs 7 test queries against the agent:
const queries = [
  "What's the status of order ORD-12345?",  // Order found
  "How can I get a refund?",                 // FAQ in knowledge base
  "Where is my order ORD-67890?",            // Order found
  "I forgot my password",                    // FAQ in knowledge base
  "What's the status of order ORD-99999?",   // Order NOT found
  "How do I upgrade to premium?",            // FAQ NOT in knowledge base
  "Can you help me with something?",         // Vague request
];
The code will prompt for feedback on each response - you can skip this for now (press s) and focus on the traces.

Viewing Your Traces

Open your Phoenix Cloud space. You’ll see 7 support-agent traces - one for each query. Click into any trace to see the full execution tree. Let’s look at two interesting cases:

Trace 1: “Can you help me with something random?”

Our support query classifier gave the following classification:
{
  "category": "faq",
  "confidence": "low",
  "reasoning": "The query is vague and doesn't specify a relevant topic, but it suggests a need for assistance, placing it within a general support context."
}
Confidence: low is a huge red flag! This tells us that our support query classifier was unable to confidently classify the user’s support query, indicating the query may be out of scope for our agent. The last span shows the most relevant context retrieved, which is
Context:
Q: How do I reset my password?
A: Go to Settings > Security > Reset Password. You'll receive an email with a reset link that expires in 24 hours.

Q: How do I update my profile information?
A: Go to Account Settings > Profile. You can update your name, email, phone number, and address there.
This context is not relevant to the user’s question at all. Therefore, our traces have given us proper insight into why the final answer was:
I’d be happy to help, but I can only assist with questions related to account settings, passwords, and profile information. Let me know if you need help with those!

Trace 2: “What’s the status of order ORD-99999?”

Our support query classifier gave us the following classification:
{
  "category": "order_status",
  "confidence": "high",
  "reasoning": "The query directly asks about the status of a specific order, indicating it is related to order tracking."
}
Hmm. Seems like our classifier thinks this question accurately falls within the scope of our agent. Let’s keep going. Our support agent LLM span chose the following tool call:
lookupOrderStatus("{\"orderId\":\"ORD-99999\"}")
Seems good… The lookOrderStatus tool call gave us:
{"error":"Order ORD-99999 not found in our system"}
Aha! Seems like ORD-99999 is an invalid order number! That’s why the final output was:
Hi there! I checked on your order with the ID ORD-99999, but it seems that I couldn't find any details at the moment. If you could provide me with more information or check back later, I'd be happy to assist you further!

Summary

Congratulations! In this tutorial, you learned how to:
  • Trace LLM calls - Capture inputs, outputs, tokens, and latency with experimental_telemetry: { isEnabled: true }
  • Trace tool calls - See tool decisions, parameters, and responses as child spans
  • Trace RAG pipelines - Monitor embeddings and see retrieved context in generation prompts
  • Group with parent spans - Nest all operations for a request into one trace
  • View and analyze traces - Debug agent behavior by exploring execution trees in Phoenix

Next Steps

You can see inside your application now - every LLM call, tool execution, and retrieval is visible. We spent some time manually analyzing traces. But how can we automate this analysis, over thousands of traces? How can we store this analysis in Phoenix, so that we can build metrics that measure our application? In the next chapter, you’ll learn to:
  • Annotate traces to mark quality issues
  • Capture user feedback (thumbs up/down) and attach it to traces
  • Run automated LLM-as-Judge evaluations to find patterns in what’s failing