LLM Evaluators
LLM evaluators use a judge model to assess the quality of outputs. These are useful for subjective or nuanced evaluations where simple rules don’t suffice.Faithfulness
Measures whether a response is faithful to (grounded in) the provided context. Detects hallucinations and unsupported claims.
Correctness
Evaluates the general correctness of an LLM response.
Document Relevance
Assesses whether retrieved documents are relevant to the input query. Useful for RAG evaluation.
Tool Selection
Determines whether the correct tool was selected for a given context from the available options.
Tool Invocation
Checks if a tool was invoked correctly with proper arguments, formatting, and safe content.
Code Evaluators
Code evaluators use deterministic logic for evaluation. These are faster, cheaper, and provide consistent results for objective criteria.Exact Match
Checks if the output exactly matches an expected value. Supports optional normalization.
Matches Regex
Validates that output matches a specified regular expression pattern.
Precision / Recall / F-Score
Computes precision, recall, and F1 scores for comparing predicted vs actual values.
Legacy Evaluators
Legacy evaluators are template-based evaluators from earlier versions of Phoenix. They remain available for backwards compatibility but we recommend using the modern evaluators above for new projects.Q&A Evaluation
Evaluates Q&A correctness using legacy templates.
Retrieval / RAG Relevance
Legacy document relevance evaluation for RAG systems.
Summarization
Evaluates summary quality using legacy templates.
Toxicity
Legacy toxicity detection evaluation.
SQL Generation
Evaluates SQL query correctness using legacy templates.
Tool Calling (Legacy)
Legacy tool calling evaluation. Consider using Tool Invocation and Tool Selection instead.
Looking to create custom evaluators? See the Building Custom Evaluators guide.

