Create custom evaluators for tasks like hallucination detection, relevance scoring, or any binary/multi-class classification:
Copy
Ask AI
import { createClassifier } from "@arizeai/phoenix-evals/llm";import { openai } from "@ai-sdk/openai";const model = openai("gpt-4o-mini");const promptTemplate = `In this task, you will be presented with a query, a reference text and an answer. The answer isgenerated to the question based on the reference text. The answer may contain false information. Youmust use the reference text to determine if the answer to the question contains false information,if the answer is a hallucination of facts. Your objective is to determine whether the answer textcontains factual information and is not a hallucination. A 'hallucination' refers toan answer that is not based on the reference text or assumes information that is not available inthe reference text. Your response should be a single word: either "factual" or "hallucinated", andit should not include any other text or characters. [BEGIN DATA] ************ [Query]: {{input}} ************ [Reference text]: {{reference}} ************ [Answer]: {{output}} ************ [END DATA]Is the answer above factual or hallucinated based on the query and reference text?`;// Create the classifierconst evaluator = await createClassifier({ model, choices: { factual: 1, hallucinated: 0 }, promptTemplate: promptTemplate,});// Use the classifierconst result = await evaluator({ output: "Arize is not open source.", input: "Is Arize Phoenix Open Source?", reference: "Arize Phoenix is a platform for building and deploying AI applications. It is open source.",});console.log(result);// Output: { label: "hallucinated", score: 0 }
The library includes several pre-built evaluators for common evaluation tasks. These evaluators come with optimized prompts and can be used directly with any AI SDK model.
Copy
Ask AI
import { createHallucinationEvaluator,} from "@arizeai/phoenix-evals/llm";import { openai } from "@ai-sdk/openai";import { anthropic } from "@ai-sdk/anthropic";const model = openai("gpt-4o-mini");// or use any other AI SDK provider// const model = anthropic("claude-3-haiku-20240307");// Hallucination Detectionconst hallucinationEvaluator = createHallucinationEvaluator({ model,});// Use the evaluatorsconst result = await hallucinationEvaluator({ input: "What is the capital of France?", context: "France is a country in Europe. Paris is its capital city.", output: "The capital of France is London.",});console.log(result);// Output: { label: "hallucinated", score: 0, explanation: "..." }
This package works seamlessly with @arizeai/phoenix-client to enable experimentation workflows. You can create datasets, run experiments, and trace evaluation calls for analysis and debugging.
import { createHallucinationEvaluator } from "@arizeai/phoenix-evals/llm";import { openai } from "@ai-sdk/openai";import { createDataset } from "@arizeai/phoenix-client/datasets";import { asEvaluator, runExperiment } from "@arizeai/phoenix-client/experiments";// Create your evaluatorconst hallucinationEvaluator = createHallucinationEvaluator({ model: openai("gpt-4o-mini"),});// Create a dataset for your experimentconst dataset = await createDataset({ name: "hallucination-eval", description: "Evaluate the hallucination of the model", examples: [ { input: { question: "Is Phoenix Open-Source?", context: "Phoenix is Open-Source.", }, }, // ... more examples ],});// Define your experimental taskconst task = async (example) => { // Your AI system's response to the question return "Phoenix is not Open-Source";};// Create a custom evaluator to validate resultsconst hallucinationCheck = asEvaluator({ name: "hallucination", kind: "LLM", evaluate: async ({ input, output }) => { // Use the hallucination evaluator from phoenix-evals const result = await hallucinationEvaluator({ input: input.question, context: input.context, // Note: uses 'context' not 'reference' output: output, }); return result; // Return the evaluation result },});// Run the experiment with automatic tracingrunExperiment({ experimentName: "hallucination-eval", experimentDescription: "Evaluate the hallucination of the model", dataset: dataset, task, evaluators: [hallucinationCheck],});