Building Custom Evaluators
While pre-built evals offer convenience, the best evals are ones you custom build for your specific use case. In this guide, we show how to build two types of custom “LLM-as-a-judge” style evaluators:- A custom
ClassificationEvaluatorthat returns categorical labels. - A custom
LLMEvaluatorthat scores data on a numeric scale.
Classification Evals
TheClassificationEvaluator is a special LLM-based evaluator designed for classification (both binary and multi-class). It leverages LLM structured-output or tool-calling functionality to ensure consistent and parseable output; this evaluator will only respond with one of the provided label choices and, optionally, an explanation for the judgement.
A classification prompt template looks like the following with instructions for the evaluation as well as placeholders for the evaluation input data:
Label Choices
While the prompt template contains instructions for the LLM, the label choices tell it how to format its response. Thechoices of a ClassificationEvaluator can be structured in a couple of ways:
- A list of string labels only:
choices=["relevant", "irrelevant"]* - String labels mapped to numeric scores:
choices = {"irrelevant": 0, "relevant": 1}
Score objects will have a label but not a numeric score component.
The ClassificationEvaluator also supports multi-class labels and scores, for example: choices = {"good": 1.0, "bad": 0.0, "neutral": 0.5}
There is no limit to the number of label choices you can provide, and you can specify any numeric scores (not limited to values between 0 and 1). For example, you can set choices = {"one": 1, "two": 2, "three": 3, "four": 4, "five": 5} for a numeric rating task.
Putting it together
For the relevance evaluation, we define the evaluator as follows:Custom Numeric Rating LLM Evaluator
TheClassificationEvaluator is a flexible LLM-as-a-judge construct that can also be used to produce numeric ratings (also known as Likert scores).
Note: We generally recommend using categorical labels over numeric ratings for most evaluation tasks. LLMs have inherent limitations in their numeric reasoning abilities, and numeric scores do not correlate as well with human judgements. See this technical report for more information about our findings on this subject.
ClassificationEvaluator for our evaluation task, similar to how we did above. Make sure to set the optimization direction = "minimize" here since a lower score is better on this task (fewer spelling errors).
Alternative: Fully Custom LLM Evaluator
Alternatively, for LLM-as-a-judge tasks that don’t fit the classification paradigm, it is also possible to create a custom evaluator that implements the baseLLMEvaluator class. We can implement our own LLMEvaluator for almost any complex eval that doesn’t fit into the classification type.
In this example, we implement the same spelling evaluator from above as a fully custom LLMEvaluator.
Steps to create a custom evaluator:
1
Create a new class that inherits the base (
LLMEvaluator)2
Define your prompt template and a JSON schema for the structured output.
3
Initialize the base class with a name, LLM, prompt template, and direction.
4
Implement the
_evaluate method that takes an eval_input and returns a list of Score objects. The base class handles the input_mapping logic so you can assume the input here has the required input fields.Improving your Custom Evals
As with all evals, it is important to test that your custom evaluators are working as expected before trusting them at scale. When testing an eval, you use many of the same techniques used for testing your application:- Start with a labeled ground truth set of data. Each input would be an example, and each labeled output would be the correct judge label.
- Test your eval on that labeled set of examples, and compare to the ground truth to calculate F1, precision, and recall scores.
- Tweak your prompt and retest.

