- Build a RAG application using LlamaIndex and OpenAI
- Instrument and trace your application with Phoenix
- Evaluate your application using LLM Evals at both trace and span levels
- Create datasets and run experiments to measure performance changes
- Analyze results and identify areas for improvement
Understanding LLM-Powered Applications
Building software with LLMs is fundamentally different from traditional software development. Rather than compiling source code into binary to run deterministic commands, we navigate datasets, embeddings, prompts, and parameter weights to generate consistent, accurate results. LLM outputs are probabilistic and don’t produce the same deterministic outcome every time. This probabilistic nature makes observability crucial for understanding and improving LLM applications.Observing Applications Using Traces
LLM Traces and Observability let us understand the system from the outside by allowing us to ask questions about the system without knowing its inner workings. This approach helps us troubleshoot novel problems and answer the question, “Why is this happening?”What are LLM Traces?
LLM Traces are a category of telemetry data used to understand the execution of LLMs and surrounding application context such as:- Retrieval from vector stores
- Usage of external tools (search engines, APIs)
- Individual steps your application takes
- Overall system performance
Notebook Walkthrough
We will go through key code snippets on this page. To follow the full tutorial, check out the notebook or video above.
LLM Ops Notebook
colab.research.google.com
Building and Tracing a RAG Application
Let’s build a RAG application that answers questions about Arize AI and trace its execution.Build the RAG Application
View Traces in Phoenix UI
After running the queries, you can view the traces in the Phoenix UI, which provides an interactive troubleshooting experience. You can sort, filter, and search for traces, and view details of each trace to understand the response generation process.
Evaluating Applications Using LLM Evals
Evaluation should serve as the primary metric for assessing your application. While examining individual queries is beneficial, it becomes impractical as the volume of edge cases and failures increases. Instead, establish a suite of metrics and automated evaluations.Trace-Level Evaluations
We’ll evaluate the entire request in full context using two key metrics:- Hallucination Detection: Whether the response contains false information
- Q&A Correctness: Whether the application answers the question correctly

Span-Level Evaluations
Now let’s evaluate the retrieval process specifically to see if retrieved documents are relevant to the queries:Creating Experimentation Workflows
Experiments allow you to measure how changes in your application affect evaluation metrics. This requires three main components:- Dataset: dataset of inputs to run the task against.
- Task: A task function that executes the system under test (for example, a function that queries your RAG system).
- Evaluators: One or more evaluators that measure the quality of the task outputs (for example, hallucination and Q&A correctness evaluators).
Define Task and Evaluators
Run Experiment


