LLM Evaluators
We provide LLM evaluators out of the box. These evaluators are vendor agnostic and can be instantiated with a Phoenix model wrapper:Code Evaluators
Code evaluators are functions that evaluate the output of your LLM task that don’t use another LLM as a judge. An example might be checking for whether or not a given output contains a link - which can be implemented as a RegEx match.phoenix.experiments.evaluators contains some pre-built code evaluators that can be passed to the evaluators parameter in experiments.
Python
contains_link evaluator can then be passed as an evaluator to any experiment you’d like to run.
For a full list of code evaluators, please consult repo or API documentation.
Custom Evaluators
The simplest way to create an evaluator is just to write a Python function. By default, a function of one argument will be passed theoutput of an experiment run. These custom evaluators can either return a boolean or numeric value which will be recorded as the evaluation score.
- Output in bounds
Imagine our experiment is testing a By simply passing the
task that is intended to output a numeric value from 1-100. We can write a simple evaluator to check if the output is within the allowed range:in_bounds function to run_experiment, we will automatically generate evaluations for each experiment run for whether or not the output is in the allowed range.| Parameter name | Description | Example |
|---|---|---|
input | experiment run input | def eval(input): ... |
output | experiment run output | def eval(output): ... |
expected | example output | def eval(expected): ... |
reference | alias for expected | def eval(reference): ... |
metadata | experiment metadata | def eval(metadata): ... |
- Edit Distance
Below is an example of using the
editdistance library to calculate how close the output is to the expected value:create_evaluator decorator to further customize how your evaluations show up in the Experiments UI.
Python
wordiness_evaluator can be passed directly into run_experiment!

