Generating Synthetic Datasets for LLM Evaluators & Agents
Learn different strategies for dataset generation and show how they can be used to run experiments and test evaluators
Synthetic datasets are a powerful way to test and refine your LLM applications, especially when real-world data is limited, sensitive, or hard to collect. By guiding the model to generate structured examples, you can quickly create datasets that cover common scenarios, complex multi-step cases, and edge cases like typos or out-of-scope queries.In this tutorial, you will learn different strategies for dataset generation and show how they can be used to run experiments and test evaluators. You will:
Generate synthetic benchmark datasets to test evaluator accuracy and coverage
Use few-shot examples to guide LLM generation for more consistent outputs
Create agent-specific datasets that cover happy paths, edge cases, and adversarial scenarios
Upload datasets to Phoenix and run experiments to validate your evaluators
This tutorial requires an OpenAI API key and a Phoenix Cloud account.
Goal: Create a synthetic dataset that allows you to test the accuracy and coverage of your evaluator.Use Case: Feed the generated dataset into an LLM-as-a-Judge or other evaluator to ensure it correctly labels intent, identifies errors, and handles a variety of query types including edge cases and noisy inputs.Synthetic data is especially useful when you want to stress-test evaluators such as an LLM-as-a-Judge across a wide range of scenarios. By generating examples systematically, you can cover straightforward cases, tricky edge cases, ambiguous queries, and noisy inputs, ensuring your evaluator captures different angles of behavior.
generate_queries_template = """Generate 25 synthetic customer support classification examples.Ensure good coverage across intents (refund, order_status, product_info),and include both correct and incorrect classifications.Each entry should follow this JSON schema:{ "input": "string (the user query)", "output": "refund | order_status | product_info (the predicted intent)", "classification": "correct | incorrect"}Respond ONLY with valid JSON array, no code fences, no extra text."""resp = openai_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": generate_queries_template}])support_data = json.loads(resp.choices[0].message.content)df_support_data = pd.DataFrame(support_data)df_support_data.head()
Strategy 2: Using Few-Shot Examples for Dataset Generation
Goal: Guide the LLM to generate synthetic examples that reflect different types of queries and scenarios while maintaining consistent labeling and structure.Few-shot prompting allows you to guide an LLM by showing a handful of examples, which helps produce more consistent and realistic outputs. This approach is particularly useful for testing evaluators because it ensures the synthetic dataset reflects patterns, labels, and structures the evaluator is expected to handle.
few_shot_prompt = """Generate synthetic customer support classification examples.Ensure good coverage across intents (refund, order_status, product_info),and include both correct and incorrect classifications.Here are some examples of synthetic customer queries and labels:Example 1:{ "user_query": "Ughhh I bought sneakers that squeak louder than a rubber duck... how do I return these?", "intent": "refund", "response": "Oh no, squeaky shoes aren’t fun! Let’s get that return started. Could you share your order number?", "classification": "correct"}Example 2:{ "user_query": "My package has been saying 'out for delivery' since last Tuesday… did it decide to take a vacation? Is it actually going to show up?", "intent": "refund", "response": "Looks like your package is taking its sweet time. Let me check where it’s stuck — can you give me the tracking number?", "classification: "incorrect"}Example 3:{ "user_query": "Thinking about upgrading my blender… does your new model actually crush ice?", "intent": "product_info", "response": "Haha our blender keeps its promises! It can definitely crush ice. Would you like more details on the specs?", "classification": "correct"}Now generate 25 new examples in the same format, keeping the reesponses friendly.Respond ONLY with valid JSON array, no code fences, no extra text."""resp = openai_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": few_shot_prompt}])few_shot_data = json.loads(resp.choices[0].message.content)few_shot_df = pd.DataFrame(few_shot_data)few_shot_df.head()
llm_judge_template = """You are an evaluator judging whether a model's classification of a customer support query is correct.The possible classifications are: refund, order_status, product_infoQuery: {query}Model Prediction: {intent}Decide if the model's prediction is correct or incorrect.Respond ONLY with one of: "correct" or "incorrect"."""from phoenix.evals import llm_classify, OpenAIModeldef task_function(input, reference): response_classification = llm_classify( data=pd.DataFrame([{"query": input["user_query"], "intent": reference["intent"]}]), template=llm_judge_template, model=OpenAIModel(model="gpt-4.1"), rails=["correct", "incorrect"], provide_explanation=True, ) label = response_classification.iloc[0]["label"] return labeldef evaluate_response(output, reference): expected_label = reference["classification"] predicted_label = output return 1 if expected_label == predicted_label else 0
Strategy 3: Creating Synthetic Datasets for Agents
Goal: Build synthetic test data that captures a wide range of queries to evaluate an agent’s reliability and safety.Use Case: Test how an agent handles in-scope requests, refuses out-of-scope queries, and manages edge cases, adversarial inputs, and noisy data.When creating synthetic datasets for agents, first define the agent’s capabilities and boundaries (tools, in-scope vs. out-of-scope). Then organize queries into categories to ensure balanced coverage:
AGENT_DATASET_PROMPT = """You are helping me create a synthetic test dataset for evaluating an AI agent.The agent has the following capabilities:- search products, compare items, track orders, answer shipping questionsThe dataset should cover a wide variety of use cases, not just the "happy path."Generate realistic **user queries**, grouped into categories:1. **Happy-path**: straightforward, common use cases where the agent should succeed.2. **Complex / multi-step**: queries requiring reasoning, multiple steps, or tool calls.3. **Edge cases**: ambiguous requests, incomplete info, or queries with constraints.4. **Adversarial / refusal**: queries that are out-of-scope or unsafe (where the agent should refuse or fallback).5. **Noise / robustness**: queries with typos, slang, or in multiple languages.For each example, return JSON with this schema:{ "category": "happy_path | multi_step | edge_case | adversarial | noise", "query": "string (the user's input)", "expected_action": "string (the tool, behavior, or refusal the agent should take)", "expected_outcome": "string (what a correct response would look like at a high level)"}Generate **10 examples total**, ensuring at least a few from each category.The queries should be diverse, realistic, and not repetitive.Respond ONLY with valid JSON, no code fences, no extra text."""resp = openai_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": AGENT_DATASET_PROMPT}])agent_data = json.loads(resp.choices[0].message.content)agent_data_df = pd.DataFrame(agent_data)agent_data_df.head()