Skip to main content
Now that you have Phoenix up and running, one of the next steps you can take is creating a Dataset & Running Experiments.
  • Datasets let you curate and organize examples to test your application systematically.
  • Experiments let you compare different model versions or configurations on the same dataset to see which performs best.

Datasets

1
Set environment variables to connect to your Phoenix instance:
export PHOENIX_API_KEY="your-api-key"

# Local (default, no API key required)
export PHOENIX_HOST="http://localhost:6006"

# Phoenix Cloud
# export PHOENIX_HOST="https://app.phoenix.arize.com/s/your-space-name"

# Self-hosted
# export PHOENIX_HOST="https://your-phoenix-instance.com"
You can find your host URL and API key in the Settings page of your Phoenix instance.
2
You can either create a Dataset in the UI, or via code.For this quickstart, you can download this sample.csv as a starter to run you through how to use datasets.
In the UI, you can either create an empty dataset and then populate data or upload from a CSV.Once you’ve downloaded the above csv file, you can follow the video below to upload your first dataset.
That’s it! You’ve now successfully created your first dataset.

Experiments

Once you have a dataset, you’re now able to run experiments. Experiments are made of tasks &, optionally, evals. While running evals is optional, they provide valuable metrics to help you compare each of your experiments quickly — such as comparing models, catching regressions, and understanding which version performs best.
1
The first step is to pull down your dataset into your code.If you made your dataset in the UI, you can follow this code snippet:
from phoenix.client import AsyncClient

client = AsyncClient()
dataset = await client.datasets.get_dataset(dataset="sample")
To get a specific version, pass version_id="your-version-id". Find version IDs in the Versions tab of your dataset.
If you created your dataset programmatically, you should already have it available as an instance assigned to your dataset variable.
2
Create a Task to evaluate.Your task can be any function with any definition & does not have to use an LLM. However, for our experiment we want to run our list of input questions through a new prompt.
import openai
from phoenix.experiments.types import Example

openai_client = openai.OpenAI()
task_prompt_template = "Answer this question: {question}"

def task(example: Example) -> str:
    question = example.input.get("input", "")
    message_content = task_prompt_template.format(question=question)
    response = openai_client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": message_content}]
    )
    return response.choices[0].message.content
3
Next step is to create your Evaluator. If you have already defined your Q&A Correctness eval from the last quick start, you won’t need to redefine it. If not, you can follow along with these code snippets.
from phoenix.evals import LLM, ClassificationEvaluator

llm = LLM(model="gpt-4o", provider="openai")

CORRECTNESS_TEMPLATE = """
You are evaluating the correctness of a response.

<rubric>
- Accuracy: Does the output contain factually correct information?
- Completeness: Does the output fully address what was asked?
- Relevance: Does the output stay on topic and answer the actual question?
</rubric>

<labels>
- correct: The output accurately and completely addresses the input
- incorrect: The output contains errors, is incomplete, or fails to address the input
</labels>

<instructions>
Review the input and output below, then determine if the output is correct or incorrect.
</instructions>

<input>{{input}}</input>
<output>{{output}}</output>
"""

correctness_evaluator = ClassificationEvaluator(
    name="correctness",
    prompt_template=CORRECTNESS_TEMPLATE,
    llm=llm,
    choices={"correct": 1.0, "incorrect": 0.0},
)
You can run multiple evaluators at once. Let’s define a custom Completeness Eval.
from phoenix.evals import ClassificationEvaluator

completeness_prompt = """
You are evaluating the completeness of a response.

<rubric>
- Coverage: Does the response address all parts of the query?
- Depth: Is sufficient detail provided for each part?
- Relevance: Is all content in the response relevant to the query?
</rubric>

<labels>
- complete: Fully answers all parts of the query with sufficient detail
- partially complete: Only answers some parts of the query or lacks detail
- incomplete: Does not meaningfully address the query
</labels>

<instructions>
Review the input and output below, then rate the completeness.
</instructions>

<input>{{input}}</input>
<output>{{output}}</output>
"""

completeness_evaluator = ClassificationEvaluator(
    name="completeness",
    prompt_template=completeness_prompt,
    llm=llm,
    choices={"complete": 1.0, "partially complete": 0.5, "incomplete": 0.0},
)
4
Now that we have defined our Task & our Evaluators, we’re now ready to run our experiment.
from phoenix.experiments import run_experiment

experiment = run_experiment(
    dataset=dataset,
    task=task,
    evaluators=[correctness_evaluator, completeness_evaluator],
)
After running multiple experiments, you can compare the experiment output & evals side by side!
Optional: If you wanted to run even more evaluators after this experiment, you can do so following this code:
from phoenix.experiments import evaluate_experiment

experiment = evaluate_experiment(experiment, evaluators=[your_additional_evaluators])

Learn More: