- Annotate traces with human feedback. This let’s you label your traces, figuring out where you need to improve.
- Capture user reactions from your application. When user’s complain, attach that feedback to your data and use it to improve.
- Run automated LLM-as-Judge evaluations to find patterns in what’s failing. Scale your analysis over thousands of traces using an LLM, so that you can build confident and data-driven analysis of what improvements need to be made.
Follow along with code: This guide has a companion TypeScript project with runnable examples. Find it here.
2.1 Human Annotations in the UI
Before automating anything, we need to know what “good” actually looks like. Is a one-sentence answer better than a detailed paragraph? Should the agent apologize when it can’t help? These depend on our users, our brand, and our use case. Human annotation is how we build that understanding. By manually reviewing traces and marking them as good, bad, or somewhere in between, we create ground truth - the gold standard that everything else gets measured against. We’ll also start noticing patterns: maybe the agent struggles with multi-part questions, or gets confused when users reference previous messages.Create Annotation Config
Navigate to Settings → Annotations in Phoenix to create annotation types. We’ll create a simple config for us to label our support agent helpfulness. Here’s a breakdown of the different annotation configurations.| Type | Example | Use Case |
|---|---|---|
| Categorical | correct / incorrect | Yes/no or multi-class labels |
| Continuous | 1-5 scale, 0-100% | Numeric scores |
| Freeform | Any text | Open-ended notes |
Annotate in the UI
Open a trace → click Annotate → fill out the form. Once we’ve annotated traces, we can filter by annotation values, export to datasets, and compare across annotators. Even 50 well-annotated traces teach you more about failure modes than weeks of guessing.2.2 Programmatic Annotations (User Feedback)
Manual annotation gives you ground truth, but it doesn’t scale. We can review maybe 50 traces a day, but your agent is handling thousands of conversations. Sometimes, our users are already telling you what’s working. Every thumbs up, thumbs down, “this wasn’t helpful” click, or escalation to a human agent is feedback. Let’s store that feedback in Phoenix, so that we can attach it to our traces! Let’s simulate a thumbs up/thumbs down feature, and then store that as annotations to our traces in Phoenix. This will give us metrics on how satified our users are.Get the Span ID from Running Code
To attach feedback to a trace, you need the span ID. Here’s how to capture it:spanId to your frontend along with the response, then send it back when the user clicks thumbs up/down.
Log Feedback via Phoenix Client
Install the Phoenix client:- Enter
yfor thumbs-up (good response) - Enter
nfor thumbs-down (bad response) - Enter
sto skip
2.3 LLM-as-Judge Evaluations
We’ve collected user feedback and identified which responses were unhelpful. Now we need to understand why they failed. Was the tool call returning errors? Was the retrieval pulling irrelevant context? Instead of manually clicking through each unhelpful trace, you can automate this analysis. We’ll create two evaluators - one for ourlookupOrderStatus tool, and the other for FAQ retrieval relevance. These evaluators annotate the child spans, so when you click into an unhelpful trace, you can immediately see what went wrong.
Install the Phoenix Evals Package
Tool Result Evaluator
Did the tool call succeed or return an error? This is a simple code-based check:Retrieval Relevance Evaluator
Was the retrieved context actually relevant to the question?Push Evaluations
Run the Evaluation Script
The tutorial includes a complete evaluation script:- Fetch tool and RAG spans from Phoenix
- Evaluate each:
- Tool spans: success vs. error (code-based check)
- Retrieval spans: relevant vs. irrelevant (LLM-based)
- Log results back as annotations on the child spans
The Debugging Workflow
Now you have a complete debugging workflow:- Run the agent (
pnpm start) and provide feedback (thumbs up/down) - Run evaluations (
pnpm evaluate) to annotate child spans - Click into unhelpful traces in Phoenix
- Check the child span annotations:
tool_result = error→ The order wasn’t foundretrieval_relevance = irrelevant→ The FAQ wasn’t in the knowledge base
Summary
Congratulations! You now have a complete quality feedback loop:| Step | What You Do | What You Learn |
|---|---|---|
| 1. User Feedback | Rate responses as helpful/unhelpful | Which traces failed |
| 2. Child Span Evals | Run pnpm evaluate | Why they failed (tool error? bad retrieval?) |
| 3. Analysis | Click into unhelpful traces | Root cause (missing order, FAQ not in KB) |
| 4. Fix | Update prompts, knowledge base, or tools | Improve the agent |
- Use feedback to identify failures
- Use automated evaluation to diagnose them
- Use trace details to understand the root cause

