Skip to main content
Phoenix evaluators support multiple prompt formats, all compatible with supported models and providers.

Supported Formats

1. String Prompts

Simple string templates with variable placeholders.
evaluator = ClassificationEvaluator(
    name="sentiment",
    llm=llm,
    prompt_template="Classify the sentiment: {text}",
    choices=["positive", "negative", "neutral"]
)

2. Message Lists

Arrays of message objects with role and content fields.
evaluator = ClassificationEvaluator(
    name="helpfulness",
    llm=llm,
    prompt_template=[
        {"role": "system", "content": "Evaluate the answer helpfulness."},
        {"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
    ],
    choices=["helpful", "somewhat_helpful", "not_helpful"]
)
Supported roles:
  • "system" - Instructions for the model.
  • "user" - User messages and input context.
  • "assistant" - Assistant/model responses (for multi-turn conversations or few-shot examples)

3. Structured Content Parts (Python only)

Messages with multiple content parts, useful for separating different pieces of context.
Only text content is supported at this time.
evaluator = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Question: {question}"},
                {"type": "text", "text": "Answer: {answer}"}
            ]
        }
    ],
    choices=["relevant", "not_relevant"]
)

Template Variables

All formats support variable substitution. Python supports both f-string ({variable}) and mustache ({{variable}}) syntax, while TypeScript supports mustache syntax only.
# Variables are provided when calling .evaluate()
result = evaluator.evaluate({
    "question": "What is Python?",
    "answer": "A programming language"
})

Client-Specific Behavior

All clients accept the same message format as input. Adapters handle client-specific transformations internally as needed:

OpenAI

  • System role is converted to developer role for reasoning models.
  • Otherwise, messages are passed as-is.

Anthropic

  • System messages are extracted and passed via system parameter
  • User/assistant messages sent in messages array

Google GenAI

  • System messages are extracted and passed via system_instruction in config
  • Assistant role converted to model role
  • Messages sent in contents array

LiteLLM

  • Messages passed directly to LiteLLM in OpenAI format
  • LiteLLM handles provider-specific conversions internally

LangChain

  • OpenAI format messages are converted to LangChain message objects (HumanMessage, AIMessage, SystemMessage)

Full Example

A complete example showing evaluator setup and usage:
from phoenix.evals import ClassificationEvaluator, LLM

llm = LLM(provider="openai", model="gpt-4o-mini")

evaluator = ClassificationEvaluator(
    name="helpfulness",
    llm=llm,
    prompt_template=[
        {"role": "system", "content": "You evaluate response helpfulness."},
        {"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
    ],
    choices=["helpful", "somewhat_helpful", "not_helpful"]
)

result = evaluator.evaluate({
    "question": "How do I learn Python?",
    "answer": "Start with online tutorials and practice daily."
})

print(result[0].label)  # e.g., "helpful"