This LLM Eval detects if the output of a model is a hallucination based on contextual data.This Eval is specifically designed to detect hallucinations in generated answers from private or retrieved data. The Eval detects if an AI answer to a question is a hallucination based on the reference data used to generate the answer.
This Eval is designed to check for hallucinations on private data, specifically on data that is fed into the context window from retrieval.It is not designed to check hallucinations on what the LLM was trained on. It is not useful for random public fact hallucinations. E.g. “What was Michael Jordan’s birthday?”
In this task, you will be presented with a query, some context and a response. The responseis generated to the question based on the context. The response may contain falseinformation. You must use the context to determine if the response to the questioncontains false information, if the response is a hallucination of facts. Your objective isto determine whether the response text contains factual information and is not ahallucination. A 'hallucination' refers to a response that is not based on the context orassumes information that is not available in the context. Your response should be a singleword: either 'factual' or 'hallucinated', and it should not include any other text orcharacters. 'hallucinated' indicates that the response provides factually inaccurateinformation to the query based on the context. 'factual' indicates that the response tothe question is correct relative to the context, and does not contain made upinformation. Please read the query and context carefully before determining yourresponse.[BEGIN DATA]************[Query]: {input}************[Context]: {context}************[Response]: {output}************[END DATA]Is the response above factual or hallucinated based on the query and context?
We are continually iterating our templates, view the most up-to-date template on GitHub.
The HallucinationEvaluator requires three inputs called input, output, and context. You can use the .describe() method on any evaluator to learn more about it, including it’s input_schema which has information about required inputs.
Copy
Ask AI
from phoenix.evals.llm import LLMfrom phoenix.evals.metrics import HallucinationEvaluator# initialize LLM and evaluatorllm = LLM(model="gpt-4o", provider="openai")hallucination = HallucinationEvaluator(llm=llm)# use the .describe() method to inspect the input_schema of any evaluatorprint(hallucination_evaluator.describe())>>> {'name': 'hallucination', 'source': 'llm', 'direction': 'maximize', 'input_schema': {'properties': { 'input': {'description': 'The input query.', 'title': 'Input', 'type': 'string'}, 'output': {'description': 'The response to the query.', 'title': 'Output', 'type': 'string'}, 'context': {'description': 'The context or reference text.', 'title': 'Context', 'type': 'string'}}, 'required': ['input', 'output', 'context'], 'title': 'HallucinationInputSchema', 'type': 'object'}}# let's test on one exampleeval_input = { "input": "Where is the Eiffel Tower located?", "context": "The Eiffel Tower is located in Paris, France. It was constructed in 1889 as the entrance arch to the 1889 World's Fair.", "output": "The Eiffel Tower is located in Paris, France.",}scores = hallucination.evaluate(eval_input=eval_input)print(scores[0])>>> Score(name='hallucination', score=1.0, label='factual', explanation='The response correctly identifies the location of the Eiffel Tower as stated in the context.', metadata={'model': 'gpt-4o'}, source='llm', direction='maximize')
This benchmark was obtained using notebook below. It was run using the HaluEval QA Dataset as a ground truth dataset. Each example in the dataset was evaluating using the HALLUCINATION_PROMPT_TEMPLATE above, then the resulting labels were compared against the is_hallucination label in the HaluEval dataset to generate the confusion matrices below.