Follow with Complete Python Notebook
Parts of an Experiment
An experiment consists of four main components:The task function represents the work your application performs for a single dataset example. You can pass the input (or other fields) from your dataset into this function, which produces an output for evaluation.The task can be as simple or complex as your real system—it might call a single LLM, invoke a multi-step pipeline, retrieve context, or execute tool calls. What matters is that the task mirrors how your application behaves in practice so that experiment results reflect real-world performance.By defining the task once and running it across all dataset examples, you ensure that every version of your system is evaluated under consistent conditions.
Evaluators determine how Phoenix measures the quality of each task output. They take the task output and optionally compare it against a reference or expected output from the dataset.Code-based evaluators are deterministic Python functions useful when you have a clear, programmatic definition of correctness, such as exact match or comparing outputs against ground truth labels.Alternatively, LLM as a Judge evaluators use an LLM to assess output quality and are useful for subjective quality assessments or when you don’t have ground truth.You can use one or multiple evaluators in the same experiment to capture different dimensions of quality.
Next, you connect the experiment to your dataset. You can run an experiment over the entire dataset or select a specific subset. This configuration step defines the exact scope and rigor of your experiment.
Once the task, evaluators, dataset, and configuration are defined, you can run the experiment. Phoenix executes the task across all selected examples, applies evaluators to each output, and stores the results.Each experiment run is tracked with its configuration, outputs, and evaluation scores, making it easy to compare experiments over time. This allows you to answer questions like whether a prompt change improved accuracy, whether a model swap introduced regressions, or how performance differs across different datasets.
Why Use Experiments Instead of Regular Evaluations?
While you can run individual evaluations on traces or ad-hoc examples, experiments provide a structured, repeatable framework for systematic evaluation. Experiments offer:- Consistency: Experiments ensure every version of your system is evaluated under identical conditions, using the same dataset and task. This eliminates variability from manual testing or one-off evaluations.
- Comparability: Experiments track configuration, outputs, and scores over time, making it easy to compare different versions of your agent. You can answer questions like “Did my prompt change improve accuracy?” or “Did switching models introduce regressions?”
- Systematic Coverage: Experiments run your task function across all dataset examples, ensuring comprehensive evaluation rather than spot-checking a few cases.
Code-Based Evaluator for Tool Call Accuracy
This experiment evaluates the accuracy of the agent’sclassify_ticket tool by comparing its output against ground truth labels in the dataset. Since we have a golden dataset with ground truth available, we can use a code-based evaluator for fast, deterministic evaluation.
Define the Task Function
In this example, the task function takes an example from the dataset, extracts the user’s query, and passes it to theclassify_ticket tool. The tool invokes the underlying classify_ticket function, which makes an LLM call to determine the ticket’s category—billing, technical, account, or other. The task function then returns this classification as its final output.
This task function will be used for all dataset examples.
For the full implementation of the classify_ticket_task function, see the reference notebook.
Define the Code-Based Evaluator
Since our dataset has ground truth available in theexpected_category field, we can use a code-based evaluator to check if the task output matches what we expect. This evaluator compares the tool’s output against the reference output from the dataset:

