Retrieval (RAG) Relevance

When To Use RAG Eval Template

This Eval evaluates whether a retrieved chunk contains an answer to the query. It’s extremely useful for evaluating retrieval systems.

RAG Eval Template

You are comparing a reference text to a question and trying to determine if the reference text
contains information relevant to answering the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {query}
    ************
    [Reference text]: {reference}
    [END DATA]

Compare the Question above to the Reference text. You must determine whether the Reference text
contains information that can answer the Question. Please focus on whether the very specific
question can be answered by the information in the Reference text.
Your response must be single word, either "relevant" or "unrelated",
and should not contain any text or characters aside from that word.
"unrelated" means that the reference text does not contain an answer to the Question.
"relevant" means the reference text contains an answer to the Question.

We are continually iterating our templates, view the most up-to-date template on GitHub.

How To Run the RAG Relevance Eval

from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails are used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)

The above runs the RAG relevancy LLM template against the dataframe df.

Benchmark Results

This benchmark was obtained using notebook below. It was run using the WikiQA dataset as a ground truth dataset. Each example in the dataset was evaluating using the RAG_RELEVANCY_PROMPT_TEMPLATE above, then the resulting labels were compared against the ground truth label in the WikiQA dataset to generate the confusion matrices below.

https://mintcdn.com/arizeai-433a7140/0x3JhHH4Of-bLwrx/images/image-10.png?fit=max&auto=format&n=0x3JhHH4Of-bLwrx&q=85&s=1961b8a14332f1359516bc2a55ec250b

Google Colab

colab.research.google.com

Try it out!

GPT-4 Result

Scikit GPT-4

RAG Eval	GPT-4o	GPT-4
Precision	0.60	0.70
Recall	0.77	0.88
F1	0.67	0.78

Throughput	GPT-4
100 Samples	113 Sec

Tracing

Prompt Engineering

Datasets & Experiments

Evaluation

Settings

Resources

Retrieval (RAG) Relevance

When To Use RAG Eval Template

RAG Eval Template

How To Run the RAG Relevance Eval

Benchmark Results

Google Colab

GPT-4 Result

Tracing

Prompt Engineering

Datasets & Experiments

Evaluation

Settings

Resources

​When To Use RAG Eval Template

​RAG Eval Template

​How To Run the RAG Relevance Eval

​Benchmark Results

Google Colab

​GPT-4 Result

When To Use RAG Eval Template

RAG Eval Template

How To Run the RAG Relevance Eval

Benchmark Results

GPT-4 Result