Google Colab
colab.research.google.com
- Understanding Retrieval Augmented Generation (RAG).
- Building RAG (with the help of a framework such as LlamaIndex).
- Evaluating RAG with Phoenix Evals.
Retrieval Augmented Generation (RAG)
LLMs are trained on vast amounts of data, but these will not include your specific data (things like company knowledge bases and documentation). Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data as context during the generation process. This is done not by altering the training data of the LLMs but by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses. In RAG, your data is loaded and prepared for queries. This process is called indexing. User queries act on this index, which filters your data down to the most relevant context. This context and your query then are sent to the LLM along with a prompt, and the LLM provides a response. RAG is a critical component for building applications such a chatbots or agents and you will want to know RAG techniques on how to get data into your application.
Stages within RAG
There are five key stages within RAG, which will in turn be a part of any larger RAG application.- Loading: This refers to getting your data from where it lives - whether it’s text files, PDFs, another website, a database or an API - into your pipeline.
- Indexing: This means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.
- Storing: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.
- Querying: For any given indexing strategy there are many ways you can utilize LLMs and data structures to query, including sub-queries, multi-step queries, and hybrid strategies.
- Evaluation: A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures on how accurate, faithful, and fast your responses to queries are.
Build a RAG system
Now that we have understood the stages of RAG, let’s build a pipeline. We will use LlamaIndex for RAG and Phoenix Evals for evaluation.Load Data and Build an Index
vector_index.as_query_engine(similarity_top_k=k).
Let’s check the text in each of these retrieved nodes.
| context.span_id | name | span_kind | attributes.input.value | attributes.retrieval.documents |
|---|---|---|---|---|
| 6aba9eee-91c9-4ee2-81e9-1bdae2eb435d | llm | LLM | NaN | NaN |
| cc9feb6a-30ba-4f32-af8d-8c62dd1b1b23 | synthesize | CHAIN | What did the author do growing up? | NaN |
| 8202dbe5-d17e-4939-abd8-153cad08bdca | embedding | EMBEDDING | NaN | NaN |
| aeadad73-485f-400b-bd9d-842abfaa460b | retrieve | RETRIEVER | What did the author do growing up? | [{‘document.content’: ‘What I Worked OnFebru… |
| 9e25c528-5e2f-4719-899a-8248bab290ec | query | CHAIN | What did the author do growing up? | NaN |
| context.span_id | attributes.input.value | attributes.retrieval.documents |
|---|---|---|
| aeadad73-485f-400b-bd9d-842abfaa460b | What did the author do growing up? | [{‘document.content’: ‘What I Worked OnFebru… |
Evaluation
Evaluation should serve as the primary metric for assessing your RAG application. It determines whether the pipeline will produce accurate responses based on the data sources and range of queries. While it’s beneficial to examine individual queries and responses, this approach is impractical as the volume of edge-cases and failures increases. Instead, it’s more effective to establish a suite of metrics and automated evaluations. These tools can provide insights into overall system performance and can identify specific areas that may require scrutiny. In a RAG system, evaluation focuses on two critical aspects:- Retrieval Evaluation: To assess the accuracy and relevance of the documents that were retrieved
- Response Evaluation: Measure the appropriateness of the response generated by the system when the context was provided.
Generate Question Context Pairs
For the evaluation of a RAG system, it’s essential to have queries that can fetch the correct context and subsequently generate an appropriate response. For this tutorial, let’s use Phoenix’sllm_generate to help us create the question-context pairs.
First, let’s create a dataframe of all the document chunks that we have indexed.
| text | |
|---|---|
| 0 | What I Worked On\n\nFebruary 2021\n\nBefore co… |
| 1 | I was puzzled by the 1401. I couldn’t figure o… |
| 2 | I remember vividly how impressed and envious I… |
| 3 | I couldn’t have put this into words when I was… |
| 4 | This was more like it; this was what I had exp… |
| question_1 | question_2 | question_3 | |
|---|---|---|---|
| 0 | What were the two main things the author worke… | What was the language the author used to write… | What was the author’s clearest memory regardin… |
| 1 | What were the limitations of the 1401 computer… | How did microcomputers change the author’s exp… | Why did the author’s father buy a TRS-80 compu… |
| 2 | What was the author’s first experience with co… | Why did the author decide to switch from study… | What were the two things that influenced the a… |
| 3 | What were the two things that inspired the aut… | What programming language did the author learn… | What was the author’s undergraduate thesis about? |
| 4 | What was the author’s undergraduate thesis about? | Which three grad schools did the author apply to? | What realization did the author have during th… |
| text | question | |
|---|---|---|
| 0 | What I Worked On\n\nFebruary 2021\n\nBefore co… | What were the two main things the author worke… |
| 1 | I was puzzled by the 1401. I couldn’t figure o… | What were the limitations of the 1401 computer… |
| 2 | I remember vividly how impressed and envious I… | What was the author’s first experience with co… |
| 3 | I couldn’t have put this into words when I was… | What were the two things that inspired the aut… |
| 4 | This was more like it; this was what I had exp… | What was the author’s undergraduate thesis about? |
| 5 | Only Harvard accepted me, so that was where I … | What realization did the author have during th… |
| 6 | So I decided to focus on Lisp. In fact, I deci… | What motivated the author to write a book abou… |
| 7 | Anyone who wanted one to play around with coul… | What realization did the author have while vis… |
| 8 | I knew intellectually that people made art — t… | What was the author’s initial perception of pe… |
| 9 | Then one day in April 1990 a crack appeared in… | What was the author’s initial plan for their d… |
Retrieval Evaluation
We are now prepared to perform our retrieval evaluations. We will execute the queries we generated in the previous step and verify whether or not that the correct context is retrieved.| context.trace_id | input | reference | document_score | ||
|---|---|---|---|---|---|
| context.span_id | document_position | ||||
| b375be95-8e5e-4817-a29f-e18f7aaa3e98 | 0 | 20e0f915-e089-4e8e-8314-b68ffdffd7d1 | How does leaving YC affect the author’s relati… | On one of them I realized I was ready to hand … | 0.820411 |
| 1 | 20e0f915-e089-4e8e-8314-b68ffdffd7d1 | How does leaving YC affect the author’s relati… | That was what it took for Rtm to offer unsolic… | 0.815969 | |
| e4e68b51-dbc9-4154-85a4-5cc69382050d | 0 | 4ad14fd2-0950-4b3f-9613-e1be5e51b5a4 | Why did YC become a fund for a couple of years… | For example, one thing Julian had done for us … | 0.860981 |
| 1 | 4ad14fd2-0950-4b3f-9613-e1be5e51b5a4 | Why did YC become a fund for a couple of years… | They were an impressive group. That first batc… | 0.849695 | |
| 27ba6b6f-828b-4732-bfcc-3262775cd71f | 0 | d62fb8e8-4247-40ac-8808-818861bfb059 | Why did the author choose the name ‘Y Combinat… | Screw the VCs who were taking so long to make … | 0.868981 |
| … | … | … | … | … | … |
| 353f152c-44ce-4f3e-a323-0caa90f4c078 | 1 | 6b7bebf6-bed3-45fd-828a-0730d8f358ba | What was the author’s first experience with co… | What I Worked On\n\nFebruary 2021\n\nBefore co… | 0.877719 |
| 16de2060-dd9b-4622-92a1-9be080564a40 | 0 | 6ce5800d-7186-414e-a1cf-1efb8d39c8d4 | What were the limitations of the 1401 computer… | I was puzzled by the 1401. I couldn’t figure o… | 0.847688 |
| 1 | 6ce5800d-7186-414e-a1cf-1efb8d39c8d4 | What were the limitations of the 1401 computer… | I remember vividly how impressed and envious I… | 0.836979 | |
| e996c90f-4ea9-4f7c-b145-cf461de7d09b | 0 | a328a85a-aadd-44f5-b49a-2748d0bd4d2f | What were the two main things the author worke… | What I Worked On\n\nFebruary 2021\n\nBefore co… | 0.843280 |
| 1 | a328a85a-aadd-44f5-b49a-2748d0bd4d2f | What were the two main things the author worke… | Then one day in April 1990 a crack appeared in… | 0.822055 |
explanations which prompts the LLM to explain it’s reasoning. This can be useful for debugging and for figuring out potential corrective actions.
Observations
Let’s now take our results and aggregate them to get a sense of how well our RAG system is performing.Response Evaluation
The retrieval evaluations demonstrates that our RAG system is not perfect. However, it’s possible that the LLM is able to generate the correct response even when the context is incorrect. Let’s evaluate the responses generated by the LLM.Observations
Let’s now take our results and aggregate them to get a sense of how well the LLM is answering the questions given the context.0.91 and a Hallucinations score 0.05 signifies that the generated answers are correct ~91% of the time and that the responses contain hallucinations 5% of the time - there is room for improvement. This could be due to the retrieval strategy or the LLM itself. We will need to investigate further to determine the root cause.
Since we have evaluated our RAG system’s QA performance and Hallucinations performance, let’s send these evaluations to Phoenix for visualization.

