AI vs Human (Groundtruth)

https://mintcdn.com/arizeai-433a7140/0x3JhHH4Of-bLwrx/images/image-10.png?fit=max&auto=format&n=0x3JhHH4Of-bLwrx&q=85&s=1961b8a14332f1359516bc2a55ec250b

Google Colab

colab.research.google.com

A workflow we see for high quality RAG deployments is generating a golden dataset of questions and a high quality set of answers. These can be in the range of 100-200 but provide a strong check for the AI generated answers. This Eval checks that the human ground truth matches the AI generated answer. Its designed to catch missing data in “half” answers and differences of substance.

Example Human vs AI on Arize Docs:

Question: What Evals are supported for LLMs on generative models? Human: Arize supports a suite of Evals available from the Phoenix Evals library, they include both pre-tested Evals and the ability to configure cusotm Evals. Some of the pre-tested LLM Evals are listed below: Retrieval Relevance, Question and Answer, Toxicity, Human Groundtruth vs AI, Citation Reference Link Relevancy, Code Readability, Code Execution, Hallucination Detection and Summarizaiton AI: Arize supports LLM Evals. Eval: Incorrect Explanation of Eval: The AI answer is very brief and lacks the specific details that are present in the human ground truth answer. While the AI answer is not incorrect in stating that Arize supports LLM Evals, it fails to mention the specific types of Evals that are supported, such as Retrieval Relevance, Question and Answer, Toxicity, Human Groundtruth vs AI, Citation Reference Link Relevancy, Code Readability, Hallucination Detection, and Summarization. Therefore, the AI answer does not fully capture the substance of the human answer. Overview of template:

We are continually iterating our templates, view the most up-to-date template on GitHub.

print(HUMAN_VS_AI_PROMPT_TEMPLATE)

You are comparing a human ground truth answer from an expert to an answer from an AI model.
Your goal is to determine if the AI answer correctly matches, in substance, the human answer.
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Human Ground Truth Answer]: {correct_answer}
    ************
    [AI Answer]: {ai_generated_answer}
    ************
    [END DATA]
Compare the AI answer to the human ground truth answer, if the AI correctly answers the question,
then the AI answer is "correct". If the AI answer is longer but contains the main idea of the
Human answer please answer "correct". If the AI answer diverges or does not contain the main
idea of the human answer, please answer "incorrect".

How to run the Human vs AI Eval:

from phoenix.evals import (
    HUMAN_VS_AI_PROMPT_RAILS_MAP,
    HUMAN_VS_AI_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

# The rails are used to hold the output to specific values based on the template
# It will remove text such as ",,," or "..."
# Will ensure the binary value expected from the template is returned
rails = list(HUMAN_VS_AI_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=HUMAN_VS_AI_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    verbose=False,
    provide_explanation=True
)

Benchmark Results:

The follow benchmarking data was gathered by comparing various model results to ground truth data. The ground truth data used was a handcrafted dataset consisting of questions about the Arize platform. That dataset is availabe here. GPT-4 Results

Eval	GPT-4o	GPT-4
Precision	0.90	0.92
Recall	0.56	0.74
F1	0.69	0.82

Tracing

Prompt Engineering

Datasets & Experiments

Evaluation

Settings

Resources

AI vs Human (Groundtruth)

Google Colab

Example Human vs AI on Arize Docs:

How to run the Human vs AI Eval:

Benchmark Results:

Tracing

Prompt Engineering

Datasets & Experiments

Evaluation

Settings

Resources

Google Colab

​Example Human vs AI on Arize Docs:

​How to run the Human vs AI Eval:

​Benchmark Results:

Example Human vs AI on Arize Docs:

How to run the Human vs AI Eval:

Benchmark Results: