Follow with Complete Python Notebook
LLM as a Judge Evaluators
LLM as a Judge evaluators use an LLM to assess output quality. These are particularly useful when correctness is subjective or hard to encode with rules, such as evaluating relevance, helpfulness, reasoning quality, or actionability. These evaluators use criteria you define, making them suitable for datasets with or without reference outputs.LLM as a Judge Evaluator for Overall Agent Performance
This experiment evaluates the overall performance of the support agent using an LLM as a Judge evaluator. This allows us to assess subjective qualities like actionability and helpfulness that are difficult to measure with code-based evaluators.Define the Task Function
The task function is what Phoenix calls for each example in your dataset. It receives the input from the dataset (in our case, thequery field) and returns an output that will be evaluated.
In this example, our task function extracts the query from the dataset input, runs the full support agent (which includes tool calls and reasoning), and returns the agent’s response:

