Skip to main content
Your support agent handles single queries well. Classification works. Tool calls execute. RAG retrieves relevant documents. But real customer support isn’t just single queries, it’s full conversations. “What’s my order status?” → “When will it arrive?” → “Can I change the address?” Each of these is a separate trace. Without sessions, they’re disconnected points in your data. You can’t see that the customer asked about the same order three times, or that the agent forgot the order ID between turns and asked for it again. Sessions change that. By grouping traces with a shared session ID, you transform isolated data points into conversation threads. In Phoenix, you can see the full back-and-forth, track metrics across the conversation (total tokens, turns to resolution), and debug issues like “the bot forgot what I said.” In this chapter, you’ll add session tracking to your support agent, run multi-turn conversations, and evaluate conversations as complete units - not just individual turns.
Follow along with code: This guide has a companion TypeScript project with runnable examples. Find it here.

3.1 Setting Up Sessions

See the full session-enabled agent code here.
Adding session tracking to your agent is surprisingly simple. You need two things:
  1. A session ID: A unique identifier for each conversation (usually a UUID)
  2. Context propagation: Making sure child spans inherit the session ID
The key insight is that session IDs are just span attributes. Set them on your parent span, and Phoenix automatically groups all related traces together.

Install Dependencies

You’ll need the OpenInference core package to set session context:
npm install @arizeai/openinference-core

Add Session Tracking to Your Agent

Here’s how to modify your support agent to support sessions:
import { setSession } from "@arizeai/openinference-core";
import { context, trace } from "@opentelemetry/api";
import { SemanticConventions } from "@arizeai/openinference-semantic-conventions";

const tracer = trace.getTracer("support-agent");

async function handleSupportQuery(
  userQuery: string,
  sessionId?: string,
  conversationHistory: Message[] = []
): Promise<AgentResponse> {
  const runAgent = async (): Promise<AgentResponse> => {
    return tracer.startActiveSpan(
      "support-agent",
      {
        attributes: {
          "openinference.span.kind": "AGENT",
          "input.value": userQuery,
          // Add session ID to the span
          ...(sessionId && { [SemanticConventions.SESSION_ID]: sessionId }),
        },
      },
      async (agentSpan) => {
        // ... agent logic ...
      }
    );
  };

  // Propagate session context to all child spans
  if (sessionId) {
    return context.with(
      setSession(context.active(), { sessionId }),
      runAgent
    );
  }
  
  return runAgent();
}
The key additions:
  1. SemanticConventions.SESSION_ID: The standard attribute name for session IDs
  2. setSession(): Propagates the session ID to all child spans
  3. context.with(): Ensures the session context is active during execution

Track Conversation History

For multi-turn conversations, you also need to track what’s been said. Here’s a simple message type:
interface Message {
  role: "user" | "assistant";
  content: string;
}

interface SessionContext {
  lastMentionedOrderId?: string;
  turnCount: number;
}
Between turns, append messages to the history and update any context the agent should remember (like order IDs the customer mentioned).

3.2 Running Multi-Turn Conversations

See the multi-turn conversation demo code here.
Now let’s see sessions in action. Here’s a conversation scenario that tests the agent’s ability to maintain context:
const sessionId = crypto.randomUUID();
const conversationHistory: Message[] = [];
const sessionContext: SessionContext = { turnCount: 0 };

// Turn 1: Ask about an order
const turn1 = await handleSupportQuery(
  "What's the status of order ORD-12345?",
  sessionId,
  conversationHistory,
  sessionContext
);

// Update history
conversationHistory.push(
  { role: "user", content: "What's the status of order ORD-12345?" },
  { role: "assistant", content: turn1.response }
);
sessionContext.lastMentionedOrderId = "ORD-12345";
sessionContext.turnCount++;

// Turn 2: Follow-up question (no order ID)
const turn2 = await handleSupportQuery(
  "When will it arrive?",
  sessionId,
  conversationHistory,
  sessionContext
);

// The agent should remember ORD-12345 from the previous turn
Run the sessions demo:
pnpm sessions
This runs three conversation scenarios:
  1. Order Inquiry: Customer asks about order, then asks follow-up questions
  2. FAQ Conversation: Multiple FAQ questions in one session
  3. Mixed Conversation: Switching between order and FAQ topics

What You’ll See in Phoenix

Now you can view and analyze your traces, grouped by user session!

3.3 Session-Level Evaluations

See the session evaluation code here.
You can now see full conversations in Phoenix, but manually reviewing every session doesn’t scale. With hundreds of conversations happening daily, you need automated insights. This is where LLM-as-Judge evaluation shines. Instead of clicking through sessions one by one, you can automatically evaluate entire conversations and answer questions like:
  • Is memory being preserved? Does the agent remember order IDs, customer preferences, and context from earlier in the conversation?
  • Are issues getting resolved? Do conversations end with the customer’s problem solved, or do they trail off unresolved?
  • Where do conversations break down? Which sessions show signs of confusion, repetition, or context loss?
By running evaluators across all your sessions, you get aggregate metrics (“85% of conversations maintain coherence”) and can quickly filter to the problematic ones. The evaluator also generates explanations, so you understand why a session was marked as incoherent or unresolved.

Conversation Coherence Evaluator

This evaluator checks if the agent maintained context throughout the conversation:
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";

const conversationCoherenceEvaluator = createClassificationEvaluator({
  name: "conversation_coherence",
  model: openai("gpt-5"),
  choices: {
    coherent: 1,
    incoherent: 0,
  },
  // Explanations are automatically generated by the evaluator
  promptTemplate: `You are evaluating whether a customer support agent maintained context throughout a multi-turn conversation.

A conversation is COHERENT if:
- The agent remembers information from earlier turns
- The agent doesn't ask for information already provided
- Responses build on previous context appropriately

A conversation is INCOHERENT if:
- The agent "forgets" things the customer said earlier
- The agent asks for the same information multiple times
- Responses seem disconnected from previous turns

[Full Conversation]:
{{input}}

Did the agent maintain context throughout this conversation?
`,
});

Resolution Evaluator

This evaluator determines if the customer’s issue was actually resolved:
const resolutionEvaluator = createClassificationEvaluator({
  name: "resolution_status",
  model: openai("gpt-5"),
  choices: {
    resolved: 1,
    unresolved: 0,
  },
  // Explanations are automatically generated by the evaluator
  promptTemplate: `You are evaluating whether a customer's issue was resolved in a support conversation.

The issue is RESOLVED if:
- The customer got the information they needed
- Their question was answered
- The conversation ended with the customer's needs met

The issue is UNRESOLVED if:
- The customer didn't get what they needed
- Questions went unanswered
- The agent couldn't help with the request

[Full Conversation]:
{{input}}

Was the customer's issue resolved?
`,
});

Running Session Evaluations

pnpm evaluate:sessions
Here’s the full evaluation flow. First, fetch spans from Phoenix and group them by session ID:
import { getSpans } from "@arizeai/phoenix-client/spans";
import { logSessionAnnotations } from "@arizeai/phoenix-client/sessions";
import { SemanticConventions } from "@arizeai/openinference-semantic-conventions";

// Fetch all agent spans
const { spans } = await getSpans({
  project: { projectName: "support-bot" },
  limit: 200,
});

// Filter to agent spans and group by session ID
const agentSpans = spans.filter((span) => span.name === "support-agent");

const sessionGroups = new Map<string, typeof agentSpans>();
for (const span of agentSpans) {
  const sessionId = span.attributes[SemanticConventions.SESSION_ID] as string;
  if (sessionId) {
    if (!sessionGroups.has(sessionId)) {
      sessionGroups.set(sessionId, []);
    }
    sessionGroups.get(sessionId)!.push(span);
  }
}

console.log(`Found ${sessionGroups.size} sessions`);
For each session, build a transcript and run the evaluators:
const sessionAnnotations = [];

for (const [sessionId, sessionSpans] of sessionGroups) {
  // Sort by turn number
  sessionSpans.sort((a, b) => {
    const turnA = (a.attributes["conversation.turn"] as number) || 0;
    const turnB = (b.attributes["conversation.turn"] as number) || 0;
    return turnA - turnB;
  });

  // Build conversation transcript
  const transcript = sessionSpans.map((span, i) => {
    const input = span.attributes["input.value"] as string || "";
    const output = span.attributes["output.value"] as string || "";
    return `Turn ${i + 1}:\nUser: ${input}\nAgent: ${output}`;
  }).join("\n\n");

  // Run coherence evaluator
  const coherenceResult = await conversationCoherenceEvaluator.evaluate({
    input: transcript,
  });

  // Run resolution evaluator  
  const resolutionResult = await resolutionEvaluator.evaluate({
    input: transcript,
  });

  // Collect annotations
  sessionAnnotations.push({
    sessionId,
    name: "conversation_coherence",
    label: coherenceResult.label ?? "unknown",
    score: coherenceResult.score ?? 0,
    explanation: coherenceResult.explanation,
    annotatorKind: "LLM" as const,
    metadata: { model: "gpt-5", turnCount: sessionSpans.length },
  });

  sessionAnnotations.push({
    sessionId,
    name: "resolution_status",
    label: resolutionResult.label ?? "unknown",
    score: resolutionResult.score ?? 0,
    explanation: resolutionResult.explanation,
    annotatorKind: "LLM" as const,
    metadata: { model: "gpt-5", turnCount: sessionSpans.length },
  });
}
Finally, log all session annotations to Phoenix:
await logSessionAnnotations({
  sessionAnnotations,
  sync: false,
});

console.log(`Logged ${sessionAnnotations.length} session-level annotations`);

Viewing and Analyzing Session Level Evals

Now that we’ve ran our session level evaluators, let’s see how our support bot performs across user sessions. Turn 1: The user asks about order ORD-67890. The agent correctly looks up the order and reports it’s processing with a December 15 ETA. Turn 2: The user switches topics entirely - “How do I cancel my subscription?” This is a FAQ question, not an order question. The agent handles it via RAG, providing the correct cancellation instructions. Turn 3: Here’s the real test. The user says “Back to my order - what’s the carrier?” They don’t repeat the order ID. They just say “my order.” Did the agent remember? Yes. It correctly referenced ORD-67890 and provided the carrier status (pending) without asking the user to repeat themselves. The session-level annotations confirm what we see:
  • conversation_coherence: coherent (score: 1.0) - The explanation notes that “the agent correctly referenced the order ID and consistent details across turns… and also handled the separate subscription question without losing track.”
  • resolution_status: resolved (score: 1.0) - The explanation confirms “the agent answered the user’s questions: provided order status and ETA, explained cancellation steps, and clarified that the carrier is currently pending.”
This is exactly what session evaluation gives you. Instead of manually reviewing each turn, you can scan the coherence and resolution scores across all sessions. When you find one marked “incoherent” or “unresolved,” click in to see the explanation and understand what went wrong.

Summary

You’ve used sessions transform your tracing data from isolated queries into conversation threads. Here are the benefits you’ve realized by using sessions:
Without SessionsWith Sessions
Individual traces, disconnectedFull conversation history
Can’t see context loss”Bot forgot what I said” is visible
Per-turn metrics onlyTotal tokens, turns to resolution
Evaluate single responsesEvaluate entire conversations
The workflow:
  1. Add session IDs to your agent (one-time setup)
  2. Track conversation history between turns
  3. View sessions in the Phoenix Sessions tab
  4. Evaluate conversations with coherence and resolution evaluators
  5. Debug patterns by clicking into problematic sessions

Congratulations!

This marks the end of the tracing tutorial. You’ve now learned how to gain observability into your LLM applications. You’ve learned how to:
  • Chapter 1: Tracing every LLM call, tool execution, and retrieval
  • Chapter 2: Annotating traces with human feedback and LLM-as-Judge
  • Chapter 3: Tracking multi-turn conversations as sessions

Next Steps

From here, you might want to explore: The patterns you’ve learned - tracing, annotation, evaluation, and sessions - apply to any LLM application. The specific evaluators and metrics will change, but the approach stays the same: observe everything, measure what matters, and use the data to improve.