(input, output, expected) -> score.
Setup
Phoenix is vendor agnostic and thus doesn’t require you to use any particular evals library. Because of this, the eval libraries for Phoenix are distributed as separate packages. The Phoenix eval libraries are very lightweight and provide many utilities to make evaluation simpler.- Python
- TypeScript
LLM Evaluators
LLM Evaluators are functions where an LLM as a judge performs the scoring of your AI task. LLM Evaluators are useful when you cannot express the scoring as simply a block of code (e.x. is the answer relevant to the question). With Phoenix you can either:- Use and extend a pre-built evaluator
- Create a custom evaluator using the evals library
- Create your own LLM evaluator using your own tooling
Pre-built LLM Evaluators
Phoenix provides LLM evaluators out of the box. These evaluators are vendor agnostic and can be instantiated with any LLM provider:- Python
- TypeScript
Custom LLM Evaluators
Phoenix eval libraries provide building blocks for you to build your own LLM-as-a-judge evaluators. You can create custom classification evaluators that use an LLM to classify outputs into categories with optional scores.- Python
- TypeScript
Code Evaluators
Code evaluators are functions that evaluate the output of your LLM task that don’t use another LLM as a judge. An example might be checking for whether or not a given output contains a link - which can be implemented as a RegEx match. The simplest way to create a code evaluator is to write a function. By default, a function of one argument will be passed theoutput of an experiment run. These evaluators can either return a boolean or numeric value which will be recorded as the evaluation score.
Simple Code Evaluators
- Python
- TypeScript
Imagine our experiment is testing a By simply passing the
task that is intended to output a numeric value from 1-100. We can write a simple evaluator to check if the output is within the allowed range:in_bounds function to run_experiment, we will automatically generate evaluations for each experiment run for whether or not the output is in the allowed range.Code Evaluators with Multiple Parameters
More complex evaluations can use additional information. These values can be accessed by defining a function with specific parameter names which are bound to special values:| Parameter name | Description | Example |
|---|---|---|
input | experiment run input | def eval(input): ... |
output | experiment run output | def eval(output): ... |
expected | example output | def eval(expected): ... |
reference | alias for expected | def eval(reference): ... |
metadata | experiment metadata | def eval(metadata): ... |
- Python
- TypeScript
Customizing Code Evaluators with create_evaluator
For better integration with the Experiments UI, use the create_evaluator function (or decorator in Python) to set display properties like the evaluator name and kind.
- Python
- TypeScript
Running Evaluators in Experiments
Evaluators are passed as a list to theevaluators parameter in run_experiment. You can use any combination of LLM evaluators, code evaluators, or simple functions.
- Python
- TypeScript

