Skip to main content
Our support agent is running, and traces are flowing into Phoenix. We can see every LLM call, tool execution, and retrieval. Users are still complaining. Some responses are helpful, others are completely wrong. We need a way to measure quality - not just observe activity. In this chapter, you’ll learn to
  • Annotate traces with human feedback. This let’s you label your traces, figuring out where you need to improve.
  • Capture user reactions from your application. When user’s complain, attach that feedback to your data and use it to improve.
  • Run automated LLM-as-Judge evaluations to find patterns in what’s failing. Scale your analysis over thousands of traces using an LLM, so that you can build confident and data-driven analysis of what improvements need to be made.
Follow along with code: This guide has a companion TypeScript project with runnable examples. Find it here.

2.1 Human Annotations in the UI

Before automating anything, we need to know what “good” actually looks like. Is a one-sentence answer better than a detailed paragraph? Should the agent apologize when it can’t help? These depend on our users, our brand, and our use case. Human annotation is how we build that understanding. By manually reviewing traces and marking them as good, bad, or somewhere in between, we create ground truth - the gold standard that everything else gets measured against. We’ll also start noticing patterns: maybe the agent struggles with multi-part questions, or gets confused when users reference previous messages.

Create Annotation Config

Navigate to Settings → Annotations in Phoenix to create annotation types. We’ll create a simple config for us to label our support agent helpfulness. Here’s a breakdown of the different annotation configurations.
TypeExampleUse Case
Categoricalcorrect / incorrectYes/no or multi-class labels
Continuous1-5 scale, 0-100%Numeric scores
FreeformAny textOpen-ended notes

Annotate in the UI

Open a trace → click Annotate → fill out the form. Once we’ve annotated traces, we can filter by annotation values, export to datasets, and compare across annotators. Even 50 well-annotated traces teach you more about failure modes than weeks of guessing.

2.2 Programmatic Annotations (User Feedback)

Manual annotation gives you ground truth, but it doesn’t scale. We can review maybe 50 traces a day, but your agent is handling thousands of conversations. Sometimes, our users are already telling you what’s working. Every thumbs up, thumbs down, “this wasn’t helpful” click, or escalation to a human agent is feedback. Let’s store that feedback in Phoenix, so that we can attach it to our traces! Let’s simulate a thumbs up/thumbs down feature, and then store that as annotations to our traces in Phoenix. This will give us metrics on how satified our users are.

Get the Span ID from Running Code

To attach feedback to a trace, you need the span ID. Here’s how to capture it:
import { trace } from "@opentelemetry/api";

async function handleSupportQuery(userQuery: string) {
  return tracer.startActiveSpan("support-agent", async (span) => {
    // Capture the span ID for later feedback
    const spanId = span.spanContext().spanId;
    
    // ... process query ...
    
    return {
      response: "Your order has shipped!",
      spanId, // Return this to your frontend
    };
  });
}
In a web application, you’d return the spanId to your frontend along with the response, then send it back when the user clicks thumbs up/down.

Log Feedback via Phoenix Client

Install the Phoenix client:
npm install @arizeai/phoenix-client
Then log annotations:
import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";

// When user clicks thumbs up
await logSpanAnnotations({
  spanAnnotations: [{
    spanId: "abc123...",  // The span ID from your response
    name: "user_feedback",
    label: "thumbs-up",
    score: 1,
    annotatorKind: "HUMAN",
    metadata: {
      source: "web_app",
      userId: "user_456",
    },
  }],
  sync: true,
});
Run the support agent, where we let you give feedback on the 7 traces and push annotations to Phoenix:
pnpm start
After the agent generates 7 responses, you’ll be prompted to rate each one:
  • Enter y for thumbs-up (good response)
  • Enter n for thumbs-down (bad response)
  • Enter s to skip
Your feedback is sent to Phoenix as annotations. Check the Annotations tab on each trace to see your ratings.

2.3 LLM-as-Judge Evaluations

We’ve collected user feedback and identified which responses were unhelpful. Now we need to understand why they failed. Was the tool call returning errors? Was the retrieval pulling irrelevant context? Instead of manually clicking through each unhelpful trace, you can automate this analysis. We’ll create two evaluators - one for our lookupOrderStatus tool, and the other for FAQ retrieval relevance. These evaluators annotate the child spans, so when you click into an unhelpful trace, you can immediately see what went wrong.

Install the Phoenix Evals Package

npm install @arizeai/phoenix-evals

Tool Result Evaluator

Did the tool call succeed or return an error? This is a simple code-based check:
// Filter for tool spans
const toolSpans = spans.filter((span) => span.name === "ai.toolCall");

for (const span of toolSpans) {
    const spanId = span.context.span_id;
    const output = JSON.stringify(span.attributes["output.value"] || "");

    // Simple check: does the output contain "error" or "not found"?
    const hasError = output.toLowerCase().includes("error") || 
                     output.toLowerCase().includes("not found");
    
    const status = hasError ? "❌ ERROR" : "✅ SUCCESS";
    console.log(`   Tool span ${spanId.substring(0, 8)}... ${status}`);

    annotations.push({
      spanId,
      name: "tool_result",
      label: hasError ? "error" : "success",
      score: hasError ? 0 : 1,
      explanation: hasError ? "Tool returned an error or 'not found' response" : "Tool executed successfully",
      annotatorKind: "LLM" as const,  // Using "LLM" for consistency, though this is code-based
      metadata: {
        evaluator: "tool_result",
        type: "code",
      },
    });
  }

Retrieval Relevance Evaluator

Was the retrieved context actually relevant to the question?
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

// Filter for the LLM calls that use retrieved context
const llmSpans = spans.filter((span) => 
    span.name === "ai.generateText" && 
    String(span.attributes["gen_ai.system"] || "").includes("Answer the user's question using ONLY the information provided in the context below. Be friendly and concise.")
);

// Create an LLM-as-Judge evaluator that determines if retrieved context was relevant
const retrievalRelevanceEvaluator = createClassificationEvaluator({
  name: "retrieval_relevance",
  model: openai("gpt-4o-mini"),
  choices: {
    relevant: 1,
    irrelevant: 0,
  },
  promptTemplate: `You are evaluating whether the retrieved context is relevant to answering the user's prompt.

Classify the retrieval as:
- RELEVANT: The context contains information that directly helps answer the question
- IRRELEVANT: The context does NOT contain useful information for the question

You are comparing the "Context" object and the "prompt" object.

[Context and Prompt]: {{input}}
`,
});

// Evaluate each RAG span
for (const span of llmSpans) {
  const spanId = span.context.span_id;
  
  // Extract the system prompt (which contains the retrieved context)
  const input = span.attributes["input.value"] as string || "";

  const result = await retrievalRelevanceEvaluator.evaluate({
      input: input,
    });
  const status = result.label === "relevant" ? "✅ RELEVANT" : "❌ IRRELEVANT";
  console.log(`   RAG span ${spanId.substring(0, 8)}... ${status}`);
  
  // Add annotation to be logged to Phoenix
  annotations.push({
    spanId,
    name: "retrieval_relevance",
    label: result.label,
    score: result.score,
    explanation: result.explanation,
    annotatorKind: "LLM",
    metadata: {
      model: "gpt-4o-mini",
      evaluator: "retrieval_relevance",
    },
  });

Push Evaluations

// Step 4: Log annotations to Phoenix
await logSpanAnnotations({
  spanAnnotations: annotations,
  sync: false,  // async mode - Phoenix processes in background
});
console.log(`✅ Logged ${annotations.length} evaluation annotations`);
The full evaluation script in the tutorial handles both evaluators.

Run the Evaluation Script

The tutorial includes a complete evaluation script:
pnpm evaluate
This will:
  1. Fetch tool and RAG spans from Phoenix
  2. Evaluate each:
    • Tool spans: success vs. error (code-based check)
    • Retrieval spans: relevant vs. irrelevant (LLM-based)
  3. Log results back as annotations on the child spans

The Debugging Workflow

Now you have a complete debugging workflow:
  1. Run the agent (pnpm start) and provide feedback (thumbs up/down)
  2. Run evaluations (pnpm evaluate) to annotate child spans
  3. Click into unhelpful traces in Phoenix
  4. Check the child span annotations:
    • tool_result = error → The order wasn’t found
    • retrieval_relevance = irrelevant → The FAQ wasn’t in the knowledge base
This tells you exactly why a trace failed, not just that it failed. For this example, we see that the agent gives an unhelpful answer to the user regarding their order number. We can quickly check the tool span to see that the order number ORD-99999 simply isn’t in the order database! Automated evals make it really fast to pinpoint root cause errors for our annotations, because they can dive deep into the trace and span data much faster than humans can!

Summary

Congratulations! You now have a complete quality feedback loop:
StepWhat You DoWhat You Learn
1. User FeedbackRate responses as helpful/unhelpfulWhich traces failed
2. Child Span EvalsRun pnpm evaluateWhy they failed (tool error? bad retrieval?)
3. AnalysisClick into unhelpful tracesRoot cause (missing order, FAQ not in KB)
4. FixUpdate prompts, knowledge base, or toolsImprove the agent
This is the debugging workflow that actually scales! Instead of manually reviewing every trace, you:
  • Use feedback to identify failures
  • Use automated evaluation to diagnose them
  • Use trace details to understand the root cause

Next Steps

Your traces are now annotated with both human feedback and automated evaluations. You can identify which responses failed and diagnose why. But there’s still a missing piece: real customer support isn’t just single queries, but full conversations between SupportBot and the customer. “What’s my order status?” followed by “When will it arrive?” followed by “Can I change the address?” In the next chapter, you’ll learn to track multi-turn conversations as sessions, giving you visibility into the full customer journey, not just isolated queries. Continue to Chapter 3: Sessions →