Skip to main content
Now that you have Phoenix up and running, and sent traces to your first project, the next step you can take is running evaluations of your application. Evaluations let you measure and monitor the quality of your application by scoring traces against metrics like accuracy, relevance, or custom checks.
1
Set environment variables to connect to your Phoenix instance:
export PHOENIX_API_KEY="your-api-key"

# Local (default, no API key required)
export PHOENIX_HOST="http://localhost:6006"

# Phoenix Cloud
# export PHOENIX_HOST="https://app.phoenix.arize.com/s/your-space-name"

# Self-hosted
# export PHOENIX_HOST="https://your-phoenix-instance.com"
You can find your host URL and API key in the Settings page of your Phoenix instance.
2
You’ll need to install the evals library that’s apart of Phoenix.
pip install -q "arize-phoenix-evals>=2"
pip install -q "arize-phoenix-client"
3
Since, we are running our evaluations on our trace data from our first project, we’ll need to pull that data into our code.
from phoenix.client import Client

px_client = Client()
primary_df = px_client.spans.get_spans_dataframe(project_identifier="my-llm-app")
4
In this example, we will define, create, and run our own evaluator. There’s a number of different evaluators you can run, but this quick start will go through an LLM as a Judge Model.1) Define your LLM Judge ModelWe’ll use OpenAI as our evaluation model for this example, but Phoenix supports virtually any model.Make sure your OPENAI_API_KEY environment variable is set, then create the LLM judge:
from phoenix.evals import LLM

llm = LLM(model="gpt-4o", provider="openai")
2) Define your EvaluatorsWe will set up a Q&A correctness Evaluator with the LLM of choice. I want to first define my LLM-as-a-Judge prompt template. Most LLM-as-a-judge evaluations can be framed as a classification task where the output is one of two or more categorical labels.
CORRECTNESS_TEMPLATE = """ 
You are given a question and an answer. Decide if the answer is fully correct. 
Rules: The answer must be factually accurate, complete, and directly address the question. 
If it is, respond with "correct". Otherwise respond with "incorrect". 
[BEGIN DATA]
    ************
    [Question]: {attributes.llm.input_messages}
    ************
    [Answer]: {attributes.llm.output_messages}
[END DATA]

Your response must be a single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the
answer.
"""
3) Create your Classification Evaluator
from phoenix.evals import create_classifier

correctness_evaluator = create_classifier(
    name="correctness",
    prompt_template=CORRECTNESS_TEMPLATE,
    llm=llm,
    choices={"correct": 1.0, "incorrect": 0.0},
)
5
Now that we have defined our evaluator, we’re ready to evaluate our traces.
from phoenix.evals import evaluate_dataframe

results_df = evaluate_dataframe(
    dataframe=primary_df,
    evaluators=[correctness_evaluator]
)
6
You’ll now be able to log your evaluations in your project view.
from phoenix.evals.utils import to_annotation_dataframe

evaluations = to_annotation_dataframe(
    dataframe=results_df
)
client.log_span_annotations(
    dataframe=evaluations
)

Next Steps