Skip to main content
To truly improve a prompt, you first need visibility - and visibility comes from data. In Part 1, we identified a misclassification in our traces and edited our prompt to fix it. But validating on a single example isn’t enough. A single trace can show you one mistake, but only a dataset can show you the pattern behind many. Part 2 of this walkthrough focuses on using Phoenix to:
  1. Run our current prompt across a dataset of inputs
  2. Compute metrics to measure prompt performance
  3. Generate natural-language feedback to guide improvements
  4. Edit and retest prompts to build confidence in our changes
  5. Save and manage our prompt versions in Prompt Hub
Follow along with code: This guide has a companion notebook with runnable code examples. Find it here, and go to Part 2: Test Prompts at Scale.

Step 1: Load Dataset of Inputs

Let’s upload a dataset of support queries and run our new classification prompt against all of them. This lets us measure performance systematically before deploying to production.
  1. Download support_queries.csv here.
  2. Navigate to Datasets and Experiments, click Create Dataset, and upload.
  3. Select query for Input keys, as this is our input column.
  4. Select ground_truth for Output keys, as this is our ground truth output.
  5. Click Create Dataset and navigate to your new dataset.

    Upload dataset to Phoenix

Step 2: Run Experiment with Our Current Prompt

With a dataset in place, the next step is to measure how our prompt performs across many examples. This gives us a clear baseline for accuracy and helps surface the common failure patterns we’ll address next.

Define Task Function

The task function specifies how to generate output for every input in the dataset. For us, we generate output by asking our LLM to classify a support query.
## Feel free to use any LLM provider. To run experiments asynchronously, use an async client.
from openai import AsyncOpenAI()
from phoenix.client import Client

async_openai_client = AsyncOpenAI()
px_client = Client()

prompt = px_client.prompts.get(prompt_identifier="support-classifier")
model = prompt._model_name
messages = prompt._template["messages"]

# let's edit the user prompt to match our dataset input key, "query"
messages[1]["content"][0]["text"] = "{{query}}"

async def task(input):
    response = await async_openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    return response.choices[0].message.content

Define Evaluators

Running the model gives us raw predictions, but that alone doesn’t tell us much. Evaluators help turn those predictions into meaningful feedback by scoring performance and explaining why the model was right or wrong. This gives us a clearer picture of how our prompt is actually performing. In this example, we’ll use two evaluators:
  • ground_truth_evaluator – Verifies whether the model’s predicted classification matches the ground truth.
  • output_evaluator – Uses an LLM to provide a richer, qualitative analysis of each classification, including:
    • explanation – Why the classification is correct or incorrect.
    • confusion_reason – If incorrect, why the model might have made the wrong choice.
    • error_type – If incorrect, what kind of error occurred (broad_vs_specific, keyword_bias, multi_intent_confusion, ambiguous_query, off_topic, paraphrase_gap, or other).
    • evidence_span – The exact phrase in the query that supports the correct classification.
    • prompt_fix_suggestion – A clear instruction you could add to the classifier prompt to prevent this kind of error in the future.
See the full eval prompt we use for analysis_evaluator in the Define Evaluators section in the notebook. By leveraging the reasoning abilities of LLMs, we can automatically annotate our failure cases with rich diagnostic information—helping us identify weaknesses and iteratively improve our prompt.
def normalize(label):
    return label.strip().strip('"').strip("'").lower()

async def ground_truth_evaluator(expected, output):
    return normalize(expected.get("ground_truth")) == normalize(output)
    
from phoenix.evals import create_evaluator
from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4.1")

SCHEMA = {
    "type": "object",
    "properties": {
        "correctness": {"type": "string", "enum": ["correct", "incorrect"]},
        "explanation": {"type": "string"},
        "confusion_reason": {"type": "string"},
        "error_type": {"type": "string"},
        "evidence_span": {"type": "string"},
        "prompt_fix_suggestion": {"type": "string"},
    },
    "required": [
        "correctness",
        "explanation",
        "confusion_reason",
        "error_type",
        "evidence_span",
        "prompt_fix_suggestion",
    ],
    "additionalProperties": False,
}

@create_evaluator(name="output_evaluator", source="llm")
def output_evaluator(query: str, ground_truth: str, output: str):
    template = analysis_evaluator_template

    prompt = (
        template.replace("{query}", query)
            .replace("{ground_truth}", ground_truth)
            .replace("{output}", output)
    )
    obj = llm.generate_object(prompt=prompt, schema=SCHEMA)
    correctness = obj["correctness"]
    score = 1.0 if correctness == "correct" else 0.0
    explanation = (
        f'correctness: {correctness}; '
        f'explanation: {obj.get("explanation","")}; '
        f'confusion_reason: {obj.get("confusion_reason","")}; '
        f'error_type: {obj.get("error_type","")}; '
        f'evidence_span: {obj.get("evidence_span","")}; '
        f'prompt_fix_suggestion: {obj.get("prompt_fix_suggestion","")};'
    )
    return {"score": score, "label": correctness, "explanation": explanation}

Run Experiment

The task function specifies how to generate output for every input in the dataset. For us, we generate output by asking our LLM to classify a support query.
from phoenix.client.experiments import async_run_experiment

experiment = await async_run_experiment(
    dataset=support_query_dataset,
    task=task,
    evaluators=[ground_truth_evaluator, analysis_evaluator],
)
Your stdout should look like
running tasks |██████████| 154/154 (100.0%) | ⏳ 00:47<00:00 |  3.21it/s
✅ Task runs completed.
🧠 Evaluation started.
running experiment evaluations |██████████| 308/308 (100.0%) | ⏳ 03:46<00:00 | 1.36it/s
Experiment completed: 154 task runs, 2 evaluator runs, 308 evaluations

Step 3: Analyze Experiment Results

After collecting our outputs and evaluation results, the next step is to interpret them. This analysis helps us see where the prompt performs well, where it fails, and which types of errors occur most often - insights we can use to guide our next round of improvements. After running the experiment in code, it will show up in the Phoenix UI on the Datasets and Experiments page and under our support query dataset. We see that our ground_truth_evaluator gave us a score of 0.53. This means that 53% of our LLM classifications correctly matched the ground truth, leaving lots of room for improvement! But we don’t just have that scalar score - we also have rich, natural language feedback that we generated from our LLM. This helps guide us into writing better prompts, based on our data! You can filter for all rows that had incorrect classifications with the following query:
evals["output_evaluator"].score == 0
Now, hover over the output_evaluator to see the natural language feedback we generated. Here’s one that stood out:
input: "downgraded me mid-billing cycle"
ground_truth: Subscription Upgrade/Downgrade
LLM classification: Billing Inquiry

correctness: incorrect
explanation: The predicted classification 'Billing Inquiry' is incorrect because the user's query is specifically about a change in subscription status ('downgraded') rather than a general billing question. The correct classification is 'Subscription Upgrade/Downgrade' as it directly addresses changes in subscription level during a billing cycle.
confusion_reason: The model likely focused on the word 'billing' and interpreted the issue as a general billing question, missing the more specific context of a subscription change. This is a classic case of confusing a broad category (Billing Inquiry) with a more specific one (Subscription Upgrade/Downgrade).
error_type: broad_vs_specific → The model picked a broader category instead of the more specific correct one.
evidence_span: downgraded me mid-billing cycle
prompt_fix_suggestion: Instruct the classifier to prefer more specific classes (like 'Subscription Upgrade/Downgrade') over broader categories (like 'Billing Inquiry') when a query mentions subscription changes.
It seems we’re hitting that same broad vs specific issue that we corrected for integration help/technical bug report in Part 1. Let’s filter for all rows with broad_vs_specific error type.
'broad_vs_specific' in evals["output_evaluator"].explanation
Seems like we have a lot (30) of broad_vs_specific error types. This makes up, by far, the largest plurality of our errors. Note that without our LLM evaluator, it would have been really hard, and much more time consuming, to figure this out! Let’s add a specific instruction to our prompt to address broad_vs_specific errors.
When classifying user queries, always prefer the most specific applicable category over a broader one. If a query mentions a clear, concrete action or object (e.g., subscription downgrade, invoice, profile name), classify it under that specific intent rather than a general one (e.g., Billing Inquiry, General Feedback).

Summary

Congratulations! You’ve successfully validated your prompt at scale-running real experiments, collecting quantitative and qualitative feedback, and uncovering exactly where and why your model fails. You used an LLM evaluator to analyze your application at scale, instead of manually reading every singe input/output pair.

Next Steps

In Part 3, we’ll enhance our prompt by adding the new instruction and adjusting key model parameters, such as model choice, temperature, top_p. Then, we’ll rerun experiments using the updated prompt and directly compare the results with our previous version. You’ll learn how to use Phoenix to experiment with and evaluate multiple prompt versions side by side-helping you identify which performs best.