- Run our current prompt across a dataset of inputs
- Compute metrics to measure prompt performance
- Generate natural-language feedback to guide improvements
- Edit and retest prompts to build confidence in our changes
- Save and manage our prompt versions in Prompt Hub
Follow along with code: This guide has a companion notebook with runnable code examples. Find it here, and go to Part 2: Test Prompts at Scale.
Step 1: Load Dataset of Inputs
Let’s upload a dataset of support queries and run our new classification prompt against all of them. This lets us measure performance systematically before deploying to production.- UI
- Python
- TypeScript
-
Download
support_queries.csvhere. - Navigate to Datasets and Experiments, click Create Dataset, and upload.
-
Select
queryfor Input keys, as this is our input column. -
Select
ground_truthfor Output keys, as this is our ground truth output. -
Click Create Dataset and navigate to your new dataset.
Upload dataset to Phoenix
Step 2: Run Experiment with Our Current Prompt
With a dataset in place, the next step is to measure how our prompt performs across many examples. This gives us a clear baseline for accuracy and helps surface the common failure patterns we’ll address next.Define Task Function
The task function specifies how to generate output for every input in the dataset. For us, we generate output by asking our LLM to classify a support query.- Python
- TypeScript
Define Evaluators
Running the model gives us raw predictions, but that alone doesn’t tell us much. Evaluators help turn those predictions into meaningful feedback by scoring performance and explaining why the model was right or wrong. This gives us a clearer picture of how our prompt is actually performing. In this example, we’ll use two evaluators:ground_truth_evaluator– Verifies whether the model’s predicted classification matches the ground truth.output_evaluator– Uses an LLM to provide a richer, qualitative analysis of each classification, including:explanation– Why the classification is correct or incorrect.confusion_reason– If incorrect, why the model might have made the wrong choice.error_type– If incorrect, what kind of error occurred (broad_vs_specific,keyword_bias,multi_intent_confusion,ambiguous_query,off_topic,paraphrase_gap, orother).evidence_span– The exact phrase in the query that supports the correct classification.prompt_fix_suggestion– A clear instruction you could add to the classifier prompt to prevent this kind of error in the future.
- Python
- TypeScript
Run Experiment
The task function specifies how to generate output for every input in the dataset. For us, we generate output by asking our LLM to classify a support query.- Python
- TypeScript
Step 3: Analyze Experiment Results
After collecting our outputs and evaluation results, the next step is to interpret them. This analysis helps us see where the prompt performs well, where it fails, and which types of errors occur most often - insights we can use to guide our next round of improvements. After running the experiment in code, it will show up in the Phoenix UI on the Datasets and Experiments page and under our support query dataset. We see that our ground_truth_evaluator gave us a score of 0.53. This means that 53% of our LLM classifications correctly matched the ground truth, leaving lots of room for improvement! But we don’t just have that scalar score - we also have rich, natural language feedback that we generated from our LLM. This helps guide us into writing better prompts, based on our data! You can filter for all rows that had incorrect classifications with the following query:broad_vs_specific error type.
broad_vs_specific error types. This makes up, by far, the largest plurality of our errors. Note that without our LLM evaluator, it would have been really hard, and much more time consuming, to figure this out!
Let’s add a specific instruction to our prompt to address broad_vs_specific errors.

