Notebook Walkthrough
Access the notebook and agent here: https://github.com/s-yeddula/phoenix-align-evals-ts/tree/main
Creating a dataset
Grab the Mastra agent traces from Phoenix and format them into dataset examples. In this example, we’ll extract the user query, the tool calls, and the agent’s final response. Once formatted, we’ll upload this dataset back into Phoenix for evaluation.Upload dataset to Phoenix
Annotate dataset examples
Next, we need human annotations to serve as ground truth for evaluation. To do this, we’ll add an annotation field in themetadata of each dataset example. This way, every example includes a reference label that our evaluator outputs can be compared against.
In this example, we’ll evaluate how well the agent’s final response aligns with the tool calls and their outputs. We’ll use three labels for evaluation: aligned, partially_aligned, and misaligned.
You can adapt this setup to other evaluation criteria as needed.

LLM Judge Improvement Cycle
Now we’ll start with a basic evaluation prompt and improve it iteratively. The workflow looks like this: Run the evaluator —> Inspect the outputs and experiment results —> Update the evaluation prompt based on what’s lacking —> Repeat until performance improves We’ll use Phoenix experiments to identify weaknesses in the evaluator, review explanations, and track performance changes over time. In this tutorial, we’ll go through two improvement cycles, but you can extend this process with more iterations to fine-tune the evaluator further.Write baseline LLM judge prompt
Define experiment task and evaluator
Run experiment
Make refinements
After observing results in Phoenix, you can make improvements to your evaluation prompt:View progress in Phoenix


