Using Evaluators

LLM Evaluators

We provide LLM evaluators out of the box. These evaluators are vendor agnostic and can be instantiated with a Phoenix model wrapper:

from phoenix.experiments.evaluators import HelpfulnessEvaluator
from phoenix.evals.models import OpenAIModel

helpfulness_evaluator = HelpfulnessEvaluator(model=OpenAIModel())

Code Evaluators

Code evaluators are functions that evaluate the output of your LLM task that don’t use another LLM as a judge. An example might be checking for whether or not a given output contains a link - which can be implemented as a RegEx match. phoenix.experiments.evaluators contains some pre-built code evaluators that can be passed to the evaluators parameter in experiments.

Python

from phoenix.experiments import run_experiment, MatchesRegex

# This defines a code evaluator for links
contains_link = MatchesRegex(
    pattern=r"[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)",
    name="contains_link"
)

The above contains_link evaluator can then be passed as an evaluator to any experiment you’d like to run. For a full list of code evaluators, please consult repo or API documentation.

Custom Evaluators

The simplest way to create an evaluator is just to write a Python function. By default, a function of one argument will be passed the output of an experiment run. These custom evaluators can either return a boolean or numeric value which will be recorded as the evaluation score.

Output in bounds

Imagine our experiment is testing a task that is intended to output a numeric value from 1-100. We can write a simple evaluator to check if the output is within the allowed range:

def in_bounds(x):
    return 1 <= x <= 100

By simply passing the in_bounds function to run_experiment, we will automatically generate evaluations for each experiment run for whether or not the output is in the allowed range.

More complex evaluations can use additional information. These values can be accessed by defining a function with specific parameter names which are bound to special values:

Parameter name	Description	Example
`input`	experiment run input	`def eval(input): ...`
`output`	experiment run output	`def eval(output): ...`
`expected`	example output	`def eval(expected): ...`
`reference`	alias for `expected`	`def eval(reference): ...`
`metadata`	experiment metadata	`def eval(metadata): ...`

These parameters can be used in any combination and any order to write custom complex evaluators!

Edit Distance

Below is an example of using the editdistance library to calculate how close the output is to the expected value:

pip install editdistance

def edit_distance(output, expected) -> int:
    return editdistance.eval(
        json.dumps(output, sort_keys=True), json.dumps(expected, sort_keys=True)
    )

For even more customization, use the create_evaluator decorator to further customize how your evaluations show up in the Experiments UI.

Python

from phoenix.experiments.evaluators import create_evaluator

# the decorator can be used to set display properties
# `name` corresponds to the metric name shown in the UI
# `kind` indicates if the eval was made with a "CODE" or "LLM" evaluator
@create_evaluator(name="shorter?", kind="CODE")
def wordiness_evaluator(expected, output):
    reference_length = len(expected.split())
    output_length = len(output.split())
    return output_length < reference_length

The decorated wordiness_evaluator can be passed directly into run_experiment!

Multiple Evaluators on Experiment Runs

Phoenix supports running multiple evals on a single experiment, allowing you to comprehensively assess your model’s performance from different angles. When you provide multiple evaluators, Phoenix creates evaluation runs for every combination of experiment runs and evaluators.

from phoenix.experiments import run_experiment
from phoenix.experiments.evaluators import ContainsKeyword, MatchesRegex

experiment = run_experiment(
    dataset,
    task,
    evaluators=[
        ContainsKeyword("hello"),
        MatchesRegex(r"\d+"),
        custom_evaluator_function
    ]
)

Tracing

Prompt Engineering

Datasets & Experiments

Evaluation

Settings

Resources

Using Evaluators

LLM Evaluators

Code Evaluators

Custom Evaluators

Multiple Evaluators on Experiment Runs

Tracing

Prompt Engineering

Datasets & Experiments

Evaluation

Settings

Resources

​LLM Evaluators

​Code Evaluators

​Custom Evaluators

​Multiple Evaluators on Experiment Runs

LLM Evaluators

Code Evaluators

Custom Evaluators

Multiple Evaluators on Experiment Runs