Faithfulness

Overview

The Faithfulness evaluator is a specialized hallucination-detection metric that determines whether an LLM’s response is grounded in and faithful to the provided context. It detects when responses contain information that is not supported by or contradicts the reference context.

When to Use

Use the Faithfulness evaluator when you need to:

Validate RAG (Retrieval-Augmented Generation) outputs - Ensure answers are based on retrieved documents or search results
Detect hallucinations in grounded responses - Identify when the LLM makes up information not present in the context
Evaluate Q&A systems over private data - Verify responses only contain information from your knowledge base

This evaluator is specifically designed for grounded responses where context is provided. It is not designed to validate general world knowledge or facts the LLM learned during training.

Supported Levels

The level of an evaluator determines the scope of the evaluation in OpenTelemetry terms. Some evaluations are applicable to individual spans, some to full traces or sessions, and some are applicable at multiple levels.

Level	Supported	Notes
Span	Yes	Best for LLM spans with RAG context. Apply to spans where `input`, `output`, and retrieved `context` are available.

Relevant span kinds: LLM spans, particularly those in RAG pipelines where documents are retrieved and used as context.

Input Requirements

The Faithfulness evaluator requires three inputs:

Field	Type	Description
`input`	`string`	The user’s query or question
`output`	`string`	The LLM’s response to evaluate
`context`	`string`	The reference context or retrieved documents

Formatting Tips

For best results:

Use human-readable strings rather than raw JSON for all inputs

For multi-turn conversations, format the input as a readable conversation:

User: What is the refund policy?
Assistant: You can request a refund within 30 days.
User: How do I request one?

For multiple retrieved documents, concatenate them with clear separators (see Input Mapping example below):

Our return policy allows returns within 30 days of purchase.

Refunds are processed within 5 business days.

Items must be in original condition with tags attached.

Output Interpretation

The evaluator returns a Score object with the following properties:

Property	Value	Description
`label`	`"faithful"` or `"unfaithful"`	Classification result
`score`	`1.0` or `0.0`	Numeric score (1.0 = faithful, 0.0 = unfaithful)
`explanation`	`string`	LLM-generated reasoning for the classification
`direction`	`"maximize"`	Higher scores are better
`metadata`	`object`	Additional information such as the model name. When tracing is enabled, includes the `trace_id` for the evaluation.

Interpretation:

Faithful (1.0): The response is fully supported by the context and does not contain made-up information
Unfaithful (0.0): The response contains information not present in the context or contradicts it

Usage Examples

Python
TypeScript

from phoenix.evals import LLM
from phoenix.evals.metrics import FaithfulnessEvaluator

# Initialize the LLM client
llm = LLM(provider="openai", model="gpt-4o")

# Create the evaluator
faithfulness_eval = FaithfulnessEvaluator(llm=llm)

# Inspect the evaluator's requirements
print(faithfulness_eval.describe())

# Evaluate a single example
eval_input = {
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
    "context": "Paris is the capital and largest city of France."
}

scores = faithfulness_eval.evaluate(eval_input)
print(scores[0])
# Score(name='faithfulness', score=1.0, label='faithful', ...)

import { createFaithfulnessEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

// Create the evaluator
const faithfulnessEvaluator = createFaithfulnessEvaluator({
  model: openai("gpt-4o"),
});

// Evaluate an example
const result = await faithfulnessEvaluator.evaluate({
  input: "What is the capital of France?",
  output: "Paris is the capital of France.",
  context: "Paris is the capital and largest city of France.",
});

console.log(result);
// { score: 1, label: "faithful", explanation: "..." }

Using Input Mapping

When your data has different field names or requires transformation, use input mapping. This is especially useful when you need to combine multiple documents into a single context string.

Python
TypeScript

from phoenix.evals import LLM
from phoenix.evals.metrics import FaithfulnessEvaluator

llm = LLM(provider="openai", model="gpt-4o")
faithfulness_eval = FaithfulnessEvaluator(llm=llm)

# Example with nested data and multiple documents
eval_input = {
    "input": {"query": "What is the return policy?"},
    "output": {"response": "You can return items within 30 days."},
    "retrieved": {
        "documents": [
            "Our return policy allows returns within 30 days.",
            "Refunds are processed within 5 business days."
        ]
    }
}

# Use input mapping with a lambda to concatenate documents
input_mapping = {
    "input": "input.query",
    "output": "output.response",
    "context": lambda x: "\n\n".join(x["retrieved"]["documents"])
}

scores = faithfulness_eval.evaluate(eval_input, input_mapping)

For more details on input mapping options, see Input Mapping.

import { bindEvaluator, createFaithfulnessEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const faithfulnessEvaluator = createFaithfulnessEvaluator({
  model: openai("gpt-4o"),
});

// Bind with input mapping for different field names
const boundEvaluator = bindEvaluator(faithfulnessEvaluator, {
  inputMapping: {
    input: "question",
    output: "answer",
    context: (data) => data.documents.join("\n\n"),
  },
});

const result = await boundEvaluator.evaluate({
  question: "What is the return policy?",
  answer: "You can return items within 30 days.",
  documents: [
    "Our return policy allows returns within 30 days.",
    "Refunds are processed within 5 business days."
  ],
});

For more details on input mapping options, see Input Mapping.

Configuration

For LLM client configuration options, see Configuring the LLM.

Viewing and Modifying the Prompt

You can view the latest versions of our prompt templates on GitHub. The evaluators are designed to work well in a variety of contexts, but we highly recommend modifying the prompt to be more specific to your use case. Feel free to adapt them.

Python
TypeScript

from phoenix.evals.metrics import FaithfulnessEvaluator
from phoenix.evals import LLM, ClassificationEvaluator

llm = LLM(provider="openai", model="gpt-4o")
evaluator = FaithfulnessEvaluator(llm=llm)

# View the prompt template
print(evaluator.prompt_template)

# Create a custom evaluator based on the built-in template
custom_evaluator = ClassificationEvaluator(
    name="faithfulness",
    prompt_template=evaluator.prompt_template,  # Modify as needed
    llm=llm,
    choices={"faithful": 1.0, "unfaithful": 0.0},
    direction="maximize",
)

import { FAITHFULNESS_CLASSIFICATION_EVALUATOR_CONFIG, createFaithfulnessEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

// View the prompt template
console.log(FAITHFULNESS_CLASSIFICATION_EVALUATOR_CONFIG.template);

// Create a custom evaluator with a modified template
const customEvaluator = createFaithfulnessEvaluator({
  model: openai("gpt-4o"),
  promptTemplate: FAITHFULNESS_CLASSIFICATION_EVALUATOR_CONFIG.template, // Modify as needed
});

Using with Phoenix

Evaluating Traces

Run evaluations on traces collected in Phoenix and log results as annotations:

Running Experiments

Use the Faithfulness evaluator in Phoenix experiments:

Using Evaluators in Experiments

API Reference

Python: FaithfulnessEvaluator
TypeScript: createFaithfulnessEvaluator

Hallucination Evaluator - Deprecated, use Faithfulness instead
Document Relevance Evaluator - Evaluate retrieved document relevance
Correctness Evaluator - Evaluate factual accuracy

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompt Engineering

Settings

Concepts

Resources

Overview

When to Use

Supported Levels

Input Requirements

Formatting Tips

Output Interpretation

Usage Examples

Using Input Mapping

Configuration

Viewing and Modifying the Prompt

Using with Phoenix

Evaluating Traces

Running Experiments

API Reference

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompt Engineering

Settings

Concepts

Resources

​Overview

​When to Use

​Supported Levels

​Input Requirements

​Formatting Tips

​Output Interpretation

​Usage Examples

​Using Input Mapping

​Configuration

​Viewing and Modifying the Prompt

​Using with Phoenix

​Evaluating Traces

​Running Experiments

​API Reference

​Related

Overview

When to Use

Supported Levels

Input Requirements

Formatting Tips

Output Interpretation

Usage Examples

Using Input Mapping

Configuration

Viewing and Modifying the Prompt

Using with Phoenix

Evaluating Traces

Running Experiments

API Reference

Related