Skip to main content
All evals templates are tested against golden data that are available as part of the LLM eval library’s benchmarked data and target precision at 70-90% and F1 at 70-85%.

Hallucination Eval

Hallucinations on answers to public and private data
Tested on:Hallucination QA Dataset,Hallucination RAG Dataset

Heuristic Metrics

List of Heuristics
Tested on:Heuristic Metrics

Q&A Eval

Private data Q&A Eval
Tested on:WikiQA

Retrieval Eval

RAG individual retrieval
Tested on:MS Marco, WikiQA

Summarization Eval

Summarization performance
Tested on:GigaWorld, CNNDM, Xsum

Code Generation Eval

Code writing correctness and readability
Tested on:WikiSQL, HumanEval, CodeXGlu

Toxicity Eval

Reference Link

User Frustration

Agent Function Calling