Skip to main content

Running Experiments

Phoenix supports two workflows for experiments: a UI-driven flow in the Playground and a programmatic SDK flow.

Run Experiments in the UI

Configure prompts and evaluators in the Playground and compare results.

Run Experiments with the SDK

Run experiments programmatically with tasks and evaluators in code.

SDK Experiment Steps

Upload a Dataset

Load your test cases into Phoenix to use as inputs for experiments.

Create a Task

Define the function or workflow you want to evaluate against your dataset.

Configure Evaluators

Set up the scoring criteria to assess your task outputs.

Run an Experiment

Execute your task across all dataset examples and collect evaluation results.

Use Repetitions

Run tasks multiple times to measure variance and consistency.

Dataset Splits

Run experiments on specific subsets of your dataset.

Using Evaluators

LLM Evaluators

Use LLM-as-a-judge to assess quality, correctness, and other criteria.

Code Evaluators

Built-in heuristic evaluators like exact match, JSON distance, and regex.

Custom Evaluators

Build your own evaluation logic with custom prompts or code.

Dataset Evaluators

Attach evaluators to datasets for automatic scoring during experiments.