- LLM-based: evaluators that use an LLM to perform the judgement.
- Examples: faithfulness, document relevance
- Code: evaluators that use a deterministic process or heuristic calculation.
- Examples: exact match, BLEU, precision
Scores
- Every score has the following properties:
- name: The human-readable name of the score/evaluator.
- kind: The origin of the evaluation signal (llm, code, or human)
- direction: The optimization direction; whether a high score is better or worse
- Scores may also have some of the following properties:
- score: numeric score
- label: The categorical outcome (e.g., “good”, “bad”, or other label).
- explanation: A brief rationale or justification for the result.
- metadata: Arbitrary extra context such as model details, intermediate scores, or run info.
Properties of Evaluators
All phoenix-evalsEvaluators have the following properties:
- Sync and async evaluate methods for evaluating a single record or example
- Single record evals return a list of
Scoreobjects. Oftentimes, this is a list of length 1 (e.g.exact_match), but some evaluators return multiple scores (e.g. precision-recall). - A discoverable
input_schemathat describes what inputs it requires to run. - Evaluators accept an arbitrary
eval_inputpayload, and an optionalinput_mappingwhich map/transforms the input to the shape they require.

