Test Prompts at Scale

To truly improve a prompt, you first need visibility - and visibility comes from data. In Part 1, we identified a misclassification in our traces and edited our prompt to fix it. But validating on a single example isn’t enough. A single trace can show you one mistake, but only a dataset can show you the pattern behind many. Part 2 of this walkthrough focuses on using Phoenix to:

Run our current prompt across a dataset of inputs
Compute metrics to measure prompt performance
Generate natural-language feedback to guide improvements
Edit and retest prompts to build confidence in our changes
Save and manage our prompt versions in Prompt Hub

Follow along with code: This guide has a companion notebook with runnable code examples. Find it here, and go to Part 2: Test Prompts at Scale.

Step 1: Load Dataset of Inputs

Let’s upload a dataset of support queries and run our new classification prompt against all of them. This lets us measure performance systematically before deploying to production.

UI
Python
TypeScript

Download support_queries.csv here.
Navigate to Datasets and Experiments, click Create Dataset, and upload.
Select query for Input keys, as this is our input column.
Select ground_truth for Output keys, as this is our ground truth output.
Click Create Dataset and navigate to your new dataset.

import pandas as pd
from phoenix.client import Client

#load our support query dataset
support_query_csv_url = "https://storage.googleapis.com/arize-phoenix-assets/assets/images/support_queries.csv"
support_query_df = pd.read_csv(support_query_csv_url)

#upload dataset to Phoenix
px_client = Client()
support_query_dataset = px_client.datasets.create_dataset(
    dataframe=support_query_df,
    name="support-query-dataset",
    input_keys=["query"],
    output_keys=["ground_truth"],
)

import { createDataset } from "@arizeai/phoenix-client/datasets";
import { parse } from "csv-parse/sync";

// 1. Fetch CSV from URL
const csvUrl = "https://storage.googleapis.com/arize-phoenix-assets/assets/images/support_queries.csv";
const response = await fetch(csvUrl);
const csvText = await response.text();

// 2. Parse CSV into rows
const rows = parse(csvText, { 
  columns: true, 
  relax_quotes: true,
  relax_column_count: true,
}) as Record<string, string>[];

// 3. Convert to Example[] format (input_keys=["query"], output_keys=["ground_truth"])
const examples = rows.map((row) => ({
  input: { query: row.query },
  output: { ground_truth: row.ground_truth },
}));

// 4. Create dataset
const { datasetId } = await createDataset({
  name: "support-query-dataset",
  description: "Support query dataset",
  examples,
});

Step 2: Run Experiment with Our Current Prompt

With a dataset in place, the next step is to measure how our prompt performs across many examples. This gives us a clear baseline for accuracy and helps surface the common failure patterns we’ll address next.

Define Task Function

The task function specifies how to generate output for every input in the dataset. For us, we generate output by asking our LLM to classify a support query.

Python
TypeScript

## Feel free to use any LLM provider. To run experiments asynchronously, use an async client.
from openai import AsyncOpenAI()
from phoenix.client import Client

async_openai_client = AsyncOpenAI()
px_client = Client()

prompt = px_client.prompts.get(prompt_identifier="support-classifier")
model = prompt._model_name
messages = prompt._template["messages"]

# let's edit the user prompt to match our dataset input key, "query"
messages[1]["content"][0]["text"] = "{{query}}"

async def task(input):
    response = await async_openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    return response.choices[0].message.content

import OpenAI from "openai";
import { getPrompt, toSDK } from "@arizeai/phoenix-client/prompts";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// Get prompt from Phoenix
const prompt = await getPrompt({
  prompt: { name: "support-classifier" },
});

// Access model and template
const model = prompt.model_name;
const messages = prompt.template.messages;

// Modify template to match dataset input key "query"
messages[1].content[0].text = "{{query}}";

// Define async task
const task = async (input: { input: { query: string } }) => {
  const openAIParams = toSDK({
    prompt,
    sdk: "openai",
    variables: { query: input.input.query },
  });

  const response = await openai.chat.completions.create({
    ...openAIParams,
    model: model || "gpt-4o",
  });

  return response.choices[0]?.message?.content || "";
};

Define Evaluators

Running the model gives us raw predictions, but that alone doesn’t tell us much. Evaluators help turn those predictions into meaningful feedback by scoring performance and explaining why the model was right or wrong. This gives us a clearer picture of how our prompt is actually performing. In this example, we’ll use two evaluators:

ground_truth_evaluator – Verifies whether the model’s predicted classification matches the ground truth.
output_evaluator – Uses an LLM to provide a richer, qualitative analysis of each classification, including:
- explanation – Why the classification is correct or incorrect.
- confusion_reason – If incorrect, why the model might have made the wrong choice.
- error_type – If incorrect, what kind of error occurred (broad_vs_specific, keyword_bias, multi_intent_confusion, ambiguous_query, off_topic, paraphrase_gap, or other).
- evidence_span – The exact phrase in the query that supports the correct classification.
- prompt_fix_suggestion – A clear instruction you could add to the classifier prompt to prevent this kind of error in the future.

See the full eval prompt we use for analysis_evaluator in the Define Evaluators section in the notebook. By leveraging the reasoning abilities of LLMs, we can automatically annotate our failure cases with rich diagnostic information—helping us identify weaknesses and iteratively improve our prompt.

Python
TypeScript

def normalize(label):
    return label.strip().strip('"').strip("'").lower()

async def ground_truth_evaluator(expected, output):
    return normalize(expected.get("ground_truth")) == normalize(output)
    
from phoenix.evals import create_evaluator
from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4.1")

SCHEMA = {
    "type": "object",
    "properties": {
        "correctness": {"type": "string", "enum": ["correct", "incorrect"]},
        "explanation": {"type": "string"},
        "confusion_reason": {"type": "string"},
        "error_type": {"type": "string"},
        "evidence_span": {"type": "string"},
        "prompt_fix_suggestion": {"type": "string"},
    },
    "required": [
        "correctness",
        "explanation",
        "confusion_reason",
        "error_type",
        "evidence_span",
        "prompt_fix_suggestion",
    ],
    "additionalProperties": False,
}

@create_evaluator(name="output_evaluator", source="llm")
def output_evaluator(query: str, ground_truth: str, output: str):
    template = analysis_evaluator_template

    prompt = (
        template.replace("{query}", query)
            .replace("{ground_truth}", ground_truth)
            .replace("{output}", output)
    )
    obj = llm.generate_object(prompt=prompt, schema=SCHEMA)
    correctness = obj["correctness"]
    score = 1.0 if correctness == "correct" else 0.0
    explanation = (
        f'correctness: {correctness}; '
        f'explanation: {obj.get("explanation","")}; '
        f'confusion_reason: {obj.get("confusion_reason","")}; '
        f'error_type: {obj.get("error_type","")}; '
        f'evidence_span: {obj.get("evidence_span","")}; '
        f'prompt_fix_suggestion: {obj.get("prompt_fix_suggestion","")};'
    )
    return {"score": score, "label": correctness, "explanation": explanation}

Run Experiment

The task function specifies how to generate output for every input in the dataset. For us, we generate output by asking our LLM to classify a support query.

Python
TypeScript

from phoenix.client.experiments import async_run_experiment

experiment = await async_run_experiment(
    dataset=support_query_dataset,
    task=task,
    evaluators=[ground_truth_evaluator, analysis_evaluator],
)

Your stdout should look like

running tasks |██████████| 154/154 (100.0%) | ⏳ 00:47<00:00 |  3.21it/s
✅ Task runs completed.
🧠 Evaluation started.
running experiment evaluations |██████████| 308/308 (100.0%) | ⏳ 03:46<00:00 | 1.36it/s
Experiment completed: 154 task runs, 2 evaluator runs, 308 evaluations

import { runExperiment } from "@arizeai/phoenix-client/experiments";

const experiment = await runExperiment({
  dataset: { datasetId: dataset.id },
  task,
  evaluators: [groundTruthEvaluator, analysisEvaluator],
});

Step 3: Analyze Experiment Results

After collecting our outputs and evaluation results, the next step is to interpret them. This analysis helps us see where the prompt performs well, where it fails, and which types of errors occur most often - insights we can use to guide our next round of improvements. After running the experiment in code, it will show up in the Phoenix UI on the Datasets and Experiments page and under our support query dataset. We see that our ground_truth_evaluator gave us a score of 0.53. This means that 53% of our LLM classifications correctly matched the ground truth, leaving lots of room for improvement! But we don’t just have that scalar score - we also have rich, natural language feedback that we generated from our LLM. This helps guide us into writing better prompts, based on our data! You can filter for all rows that had incorrect classifications with the following query:

evals["output_evaluator"].score == 0

Now, hover over the output_evaluator to see the natural language feedback we generated. Here’s one that stood out:

input: "downgraded me mid-billing cycle"
ground_truth: Subscription Upgrade/Downgrade
LLM classification: Billing Inquiry

correctness: incorrect
explanation: The predicted classification 'Billing Inquiry' is incorrect because the user's query is specifically about a change in subscription status ('downgraded') rather than a general billing question. The correct classification is 'Subscription Upgrade/Downgrade' as it directly addresses changes in subscription level during a billing cycle.
confusion_reason: The model likely focused on the word 'billing' and interpreted the issue as a general billing question, missing the more specific context of a subscription change. This is a classic case of confusing a broad category (Billing Inquiry) with a more specific one (Subscription Upgrade/Downgrade).
error_type: broad_vs_specific → The model picked a broader category instead of the more specific correct one.
evidence_span: downgraded me mid-billing cycle
prompt_fix_suggestion: Instruct the classifier to prefer more specific classes (like 'Subscription Upgrade/Downgrade') over broader categories (like 'Billing Inquiry') when a query mentions subscription changes.

It seems we’re hitting that same broad vs specific issue that we corrected for integration help/technical bug report in Part 1. Let’s filter for all rows with broad_vs_specific error type.

'broad_vs_specific' in evals["output_evaluator"].explanation

Seems like we have a lot (30) of broad_vs_specific error types. This makes up, by far, the largest plurality of our errors. Note that without our LLM evaluator, it would have been really hard, and much more time consuming, to figure this out! Let’s add a specific instruction to our prompt to address broad_vs_specific errors.

When classifying user queries, always prefer the most specific applicable category over a broader one. If a query mentions a clear, concrete action or object (e.g., subscription downgrade, invoice, profile name), classify it under that specific intent rather than a general one (e.g., Billing Inquiry, General Feedback).

Summary

Congratulations! You’ve successfully validated your prompt at scale-running real experiments, collecting quantitative and qualitative feedback, and uncovering exactly where and why your model fails. You used an LLM evaluator to analyze your application at scale, instead of manually reading every singe input/output pair.

Next Steps

In Part 3, we’ll enhance our prompt by adding the new instruction and adjusting key model parameters, such as model choice, temperature, top_p. Then, we’ll rerun experiments using the updated prompt and directly compare the results with our previous version. You’ll learn how to use Phoenix to experiment with and evaluate multiple prompt versions side by side-helping you identify which performs best.

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompt Engineering

Settings

Concepts

Resources

Test Prompts at Scale

Step 1: Load Dataset of Inputs

Step 2: Run Experiment with Our Current Prompt

Define Task Function

Define Evaluators

Run Experiment

Step 3: Analyze Experiment Results

Summary

Next Steps

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompt Engineering

Settings

Concepts

Resources

​Step 1: Load Dataset of Inputs

​Step 2: Run Experiment with Our Current Prompt

​Define Task Function

​Define Evaluators

​Run Experiment

​Step 3: Analyze Experiment Results

​Summary

​Next Steps

Step 1: Load Dataset of Inputs

Step 2: Run Experiment with Our Current Prompt

Define Task Function

Define Evaluators

Run Experiment

Step 3: Analyze Experiment Results

Summary

Next Steps