Skip to main content
This tutorial shows how to classify documents as relevant or irrelevant to queries using benchmark datasets with ground-truth labels. Key Points:
  • Download and prepare benchmark datasets for relevance classification
  • Compare different LLM models (GPT-4, GPT-3.5, GPT-4 Turbo) for classification accuracy
  • Analyze results with confusion matrices and detailed reports
  • Get explanations for LLM classifications to understand decision-making
  • Measure retrieval quality using ranking metrics like precision@k

Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the full notebook.
https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix-docs-images/gc.ico

Google Colab

colab.research.google.com

Download Benchmark Dataset

df = download_benchmark_dataset(
    task="binary-relevance-classification",
    dataset_name="wiki_qa-train"
)

Configure Evaluation

N_EVAL_SAMPLE_SIZE = 100
df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)
df_sample = df_sample.rename(columns={
    "query_text": "input",
    "document_text": "reference",
})

Run Relevance Classification

model = OpenAIModel(model="gpt-4", temperature=0.0)
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())

relevance_classifications = llm_classify(
    dataframe=df_sample,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    concurrency=20,
)["label"].tolist()

Evaluate Results

true_labels = df_sample["relevant"].map(RAG_RELEVANCY_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, relevance_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

Get Explanations

relevance_classifications_df = llm_classify(
    dataframe=df_sample.sample(n=5),
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True,
    concurrency=20,
)

Compare Models

Run the same evaluation with different models:
# GPT-3.5
model_gpt35 = OpenAIModel(model="gpt-3.5-turbo", temperature=0.0)

# GPT-4 Turbo
model_gpt4turbo = OpenAIModel(model="gpt-4-turbo-preview", temperature=0.0)