Summarization

When To Use Summarization Eval Template

This Eval helps evaluate the summarization results of a summarization task. The template variables are:

document: The document text to summarize
summary: The summary of the document

Summarization Eval Template

You are comparing the summary text and it's original document and trying to determine
if the summary is good. Here is the data:
    [BEGIN DATA]
    ************
    [Summary]: {output}
    ************
    [Original Document]: {input}
    [END DATA]
Compare the Summary above to the Original Document and determine if the Summary is
comprehensive, concise, coherent, and independent relative to the Original Document.
Your response must be a single word, either "good" or "bad", and should not contain any text
or characters aside from that. "bad" means that the Summary is not comprehensive,
concise, coherent, and independent relative to the Original Document. "good" means the
Summary is comprehensive, concise, coherent, and independent relative to the Original Document.

We are continually iterating our templates, view the most up-to-date template on GitHub.

How To Run the Summarization Eval

import phoenix.evals.default_templates as templates
from phoenix.evals import (
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails are used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned
rails = list(templates.SUMMARIZATION_PROMPT_RAILS_MAP.values())
summarization_classifications = llm_classify(
    dataframe=df_sample,
    template=templates.SUMMARIZATION_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)

The above shows how to use the summarization Eval template.

Benchmark Results

This benchmark was obtained using notebook below. It was run using a Daily Mail CNN summarization dataset as a ground truth dataset. Each example in the dataset was evaluating using the SUMMARIZATION_PROMPT_TEMPLATE above, then the resulting labels were compared against the ground truth label in the summarization dataset to generate the confusion matrices below.

https://mintcdn.com/arizeai-433a7140/0x3JhHH4Of-bLwrx/images/image-10.png?fit=max&auto=format&n=0x3JhHH4Of-bLwrx&q=85&s=1961b8a14332f1359516bc2a55ec250b

Google Colab

colab.research.google.com

Try it out!

GPT-4 Results

Eval	GPT-4o	GPT-4
Precision	0.87	0.79
Recall	0.63	0.88
F1	0.73	0.83

Tracing

Prompt Engineering

Datasets & Experiments

Evaluation

Settings

Resources

When To Use Summarization Eval Template

Summarization Eval Template

How To Run the Summarization Eval

Benchmark Results

Google Colab

GPT-4 Results

Tracing

Prompt Engineering

Datasets & Experiments

Evaluation

Settings

Resources

​When To Use Summarization Eval Template

​Summarization Eval Template

​How To Run the Summarization Eval

​Benchmark Results

Google Colab

​GPT-4 Results

When To Use Summarization Eval Template

Summarization Eval Template

How To Run the Summarization Eval

Benchmark Results

GPT-4 Results