Phoenix evaluators support multiple prompt formats, all compatible with supported models and providers.

Supported Formats

1. String Prompts

Simple string templates with variable placeholders.

Python
TypeScript

evaluator = ClassificationEvaluator(
    name="sentiment",
    llm=llm,
    prompt_template="Classify the sentiment: {text}",
    choices=["positive", "negative", "neutral"]
)

import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const model = openai("gpt-4o-mini");

const evaluator = createClassificationEvaluator({
  name: "sentiment",
  model,
  promptTemplate: "Classify the sentiment: {{text}}",
  choices: { positive: 1, negative: 0, neutral: 0.5 },
});

2. Message Lists

Arrays of message objects with role and content fields.

Python
TypeScript

evaluator = ClassificationEvaluator(
    name="helpfulness",
    llm=llm,
    prompt_template=[
        {"role": "system", "content": "Evaluate the answer helpfulness."},
        {"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
    ],
    choices=["helpful", "somewhat_helpful", "not_helpful"]
)

Supported roles:

"system" - Instructions for the model.
"user" - User messages and input context.
"assistant" - Assistant/model responses (for multi-turn conversations or few-shot examples)

import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const model = openai("gpt-4o-mini");

const evaluator = createClassificationEvaluator({
  name: "helpfulness",
  model,
  promptTemplate: [
    { role: "system", content: "Evaluate the answer helpfulness." },
    { role: "user", content: "Question: {{question}}\nAnswer: {{answer}}" },
  ],
  choices: { helpful: 1, somewhat_helpful: 0.5, not_helpful: 0 },
});

Supported roles:

"system" - Instructions for the model.
"user" - User messages and input context.
"assistant" - Assistant/model responses (for multi-turn conversations or few-shot examples)

3. Structured Content Parts (Python only)

Messages with multiple content parts, useful for separating different pieces of context.

Python
TypeScript

Only text content is supported at this time.

evaluator = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Question: {question}"},
                {"type": "text", "text": "Answer: {answer}"}
            ]
        }
    ],
    choices=["relevant", "not_relevant"]
)

Template Variables

All formats support variable substitution. Python supports both f-string ({variable}) and mustache ({{variable}}) syntax, while TypeScript supports mustache syntax only.

Python
TypeScript

# Variables are provided when calling .evaluate()
result = evaluator.evaluate({
    "question": "What is Python?",
    "answer": "A programming language"
})

// Variables are provided when calling .evaluate()
const result = await evaluator.evaluate({
  question: "What is Python?",
  answer: "A programming language",
});

console.log(result.label); // e.g., "relevant"

Client-Specific Behavior

Python
TypeScript

All clients accept the same message format as input. Adapters handle client-specific transformations internally as needed:

OpenAI

System role is converted to developer role for reasoning models.
Otherwise, messages are passed as-is.

Anthropic

System messages are extracted and passed via system parameter
User/assistant messages sent in messages array

Google GenAI

System messages are extracted and passed via system_instruction in config
Assistant role converted to model role
Messages sent in contents array

LiteLLM

Messages passed directly to LiteLLM in OpenAI format
LiteLLM handles provider-specific conversions internally

LangChain

OpenAI format messages are converted to LangChain message objects (HumanMessage, AIMessage, SystemMessage)

Full Example

A complete example showing evaluator setup and usage:

Python
TypeScript

from phoenix.evals import ClassificationEvaluator, LLM

llm = LLM(provider="openai", model="gpt-4o-mini")

evaluator = ClassificationEvaluator(
    name="helpfulness",
    llm=llm,
    prompt_template=[
        {"role": "system", "content": "You evaluate response helpfulness."},
        {"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
    ],
    choices=["helpful", "somewhat_helpful", "not_helpful"]
)

result = evaluator.evaluate({
    "question": "How do I learn Python?",
    "answer": "Start with online tutorials and practice daily."
})

print(result[0].label)  # e.g., "helpful"

import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const model = openai("gpt-4o-mini");

const evaluator = createClassificationEvaluator({
  name: "helpfulness",
  model,
  promptTemplate: [
    { role: "system", content: "You evaluate response helpfulness." },
    { role: "user", content: "Question: {{question}}\nAnswer: {{answer}}" },
  ],
  choices: { helpful: 1, somewhat_helpful: 0.5, not_helpful: 0 },
});

const result = await evaluator.evaluate({
  question: "How do I learn Python?",
  answer: "Start with online tutorials and practice daily.",
});

console.log(result.label); // e.g., "helpful"

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompt Engineering

Settings

Concepts

Resources

Prompt Formats

Supported Formats

1. String Prompts

2. Message Lists

3. Structured Content Parts (Python only)

Template Variables

Client-Specific Behavior

OpenAI

Anthropic

Google GenAI

LiteLLM

LangChain

Full Example

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompt Engineering

Settings

Concepts

Resources

​Supported Formats

​1. String Prompts

​2. Message Lists

​3. Structured Content Parts (Python only)

​Template Variables

​Client-Specific Behavior

​OpenAI

​Anthropic

​Google GenAI

​LiteLLM

​LangChain

​Full Example

Supported Formats

1. String Prompts

2. Message Lists

3. Structured Content Parts (Python only)

Template Variables

Client-Specific Behavior

OpenAI

Anthropic

Google GenAI

LiteLLM

LangChain

Full Example