Reference (Citation) Link

Reference Links in Retrieval Q&A

In chatbots and Q&A systems, many times reference links are provided in the response, along with an answer, to help point users to documentation or pages that contain more information or the source for the answer. EXAMPLE: Q&A from Arize-Phoenix Documentation QUESTION: What other models does Arize Phoenix support beyond OpenAI for running Evals? ANSWER: Phoenix does support a large set of LLM models through the model object. Phoenix supports OpenAI (GPT-4, GPT-4-32k, GPT-3.5 Turbo, GPT-3.5 Instruct, etc…), Azure OpenAI, Google Palm2 Text Bison, and All AWS Bedrock models (Claude, Mistral, etc…). REFERENCE LINK: https://arize.com/docs/phoenix/sdk-api-reference/python/arize-phoenix-evals This Eval checks the reference link returned answers the question asked in a conversation

We are continually iterating our templates, view the most up-to-date template on GitHub.

print(REF_LINK_EVAL_PROMPT_TEMPLATE_STR)

You are given a conversation that contains questions by a CUSTOMER and you are trying
to determine if the documentation page shared by the ASSISTANT correctly answers
the CUSTOMERS questions. We will give you the conversation between the customer
and the ASSISTANT and the text of the documentation returned:
    [CONVERSATION AND QUESTION]:
    {conversation}
    ************
    [DOCUMENTATION URL TEXT]:
    {document_text}
    [DOCUMENTATION URL TEXT]:
You should respond "correct" if the documentation text answers the question the
CUSTOMER had in the conversation. If the documentation roughly answers the question
even in a general way the please answer "correct". If there are multiple questions and a single
question is answered, please still answer "correct". If the text does not answer the
question in the conversation, or doesn't contain information that would allow you
to answer the specific question please answer "incorrect".

How to run the Citation Eval

from phoenix.evals import (
    REF_LINK_EVAL_PROMPT_RAILS_MAP,
    REF_LINK_EVAL_PROMPT_TEMPLATE_STR,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails are used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned
rails = list(REF_LINK_EVAL_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=REF_LINK_EVAL_PROMPT_TEMPLATE_STR,
    model=model,
    rails=rails,
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)

Benchmark Results

This benchmark was obtained using notebook below. It was run using a handcrafted ground truth dataset consisting of questions on the Arize platform. That dataset is available here. Each example in the dataset was evaluating using the REF_LINK_EVAL_PROMPT_TEMPLATE_STR above, then the resulting labels were compared against the ground truth label in the benchmark dataset to generate the confusion matrices below.

https://mintcdn.com/arizeai-433a7140/0x3JhHH4Of-bLwrx/images/image-10.png?fit=max&auto=format&n=0x3JhHH4Of-bLwrx&q=85&s=1961b8a14332f1359516bc2a55ec250b

Google Colab

colab.research.google.com

GPT-4 Results

Reference Link Evals	GPT-4o
Precision	0.96
Recall	0.79
F1	0.87

Tracing

Prompt Engineering

Datasets & Experiments

Evaluation

Settings

Resources

Reference (Citation) Link

Reference Links in Retrieval Q&A

How to run the Citation Eval

Benchmark Results

Google Colab

Tracing

Prompt Engineering

Datasets & Experiments

Evaluation

Settings

Resources

​Reference Links in Retrieval Q&A

​How to run the Citation Eval

​Benchmark Results

Google Colab

Reference Links in Retrieval Q&A

How to run the Citation Eval

Benchmark Results