Skip to main content
All evals templates are tested against golden data that are available as part of the LLM eval library’s benchmarked data and target precision at 70-90% and F1 at 70-85%.

1. Hallucination Eval


Hallucinations on answers to public and private data


Tested on:

Hallucination QA Dataset, Hallucination RAG Dataset

2. Code Metrics

3. Q&A Eval


Private data Q&A Eval


Tested on:

WikiQA

4. Retrieval Eval


RAG individual retrieval


Tested on:

MS Marco, WikiQA

5. Summarization Eval


Summarization performance


Tested on:

GigaWorld, CNNDM, Xsum

6. Code Generation Eval


Code writing correctness and readability


Tested on:

WikiSQL, HumanEval, CodeXGlu

7. Toxicity Eval

9. Reference Link

10. User Frustration

12. Agent Function Calling