Using Human Annotations for Eval Driven Development
How to leverage human annotations to build evaluations and experiments that improve your system
In this tutorial, we will explore how to build a custom human annotation interface for Phoenix using Lovable. We will then leverage those annotations to construct experiments and evaluate your application.The purpose of a custom annotations UI is to make it easy for anyone to provide structured human feedback on traces, capturing essential details directly in Phoenix. Annotations are vital for collecting feedback during human review, enabling iterative improvement of your LLM applications.
"Applying the scientific method to building AI products" from Eugene Yan in https://eugeneyan.com/writing/eval-process
By establishing this feedback loop and an evaluation pipeline, you can effectively monitor and enhance your system’s performance.
We will generate some LLM traces and send them to Phoenix. We will then annotate these traces to add labels, scores, or explanations directly onto specific spans.
Copy
Ask AI
questions = [ "What is the capital of France?", "Who wrote 'Pride and Prejudice'?", "What is the boiling point of water in Celsius?", "What is the largest planet in our solar system?", "Who developed the theory of relativity?", "What is the chemical symbol for gold?", "In which year did the Apollo 11 mission land on the moon?", "What language has the most native speakers worldwide?", "Which continent has the most countries?", "What is the square root of 144?", "What is the largest country in the world by land area?", "Why is the sky blue?", "Who painted the Mona Lisa?", "What is the smallest prime number?", "What gas do plants absorb from the atmosphere?", "Who was the first President of the United States?", "What is the currency of Japan?", "How many continents are there on Earth?", "What is the tallest mountain in the world?", "Who is the author of '1984'?",]
We deliberately generate some bad or nonsensical traces in the system prompt to demonstrate annotating and experimenting with different types of results.
Copy
Ask AI
from openai import OpenAIopenai_client = OpenAI()# System promptsystem_prompt = """You are a question-answering assistant. For each user question, randomly choose an option: NONSENSE or RHYME. If you choose RHYME, answer correctly in the form of a rhyme.If it NONSENSE, do not answer the question at all, and instead respond with nonsense words and random numbers that do not rhyme, ignoring the user’s question completely.When responding with NONSENSE, include at least five nonsense words and at least five random numbers between 0 and 9999 in your response.Do not explain your choice."""# Run through the dataset and collect spansfor question in questions: response = openai_client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": question}, ], )
Next, you will construct an LLM-as-a-Judge template to evaluate your experiments. This evaluator will mark nonsensical outputs as incorrect. As you experiment, you’ll see evaluation results improve. Once your annotated trace dataset shows consistent improvement, you can confidently apply these changes to your production system.
Copy
Ask AI
RHYME_PROMPT_TEMPLATE = """Examine the assistant’s responses in the conversation and determine whether the assistant used rhyme in any of its responses.Rhyme means that the assistant’s response contains clear end rhymes within or across lines. This should be applicable to the entire response.There should be no irrelevant phrases or numbers in the response.Determine whether the rhyme is high quality or forced in addition to checking for the presence of rhyme.This is the criteria for determining a well-written rhyme.If none of the assistant's responses contain rhyme, output that the assistant did not rhyme.[BEGIN DATA] ************ [Question]: {question} ************ [Response]: {answer} [END DATA]Your response must be a single word, either "correct" or "incorrect", and should not contain any text or characters aside from that word."correct" means the response contained a well written rhyme."incorrect" means the response did not contain a rhyme."""
Experimentation Example: Improving the System Prompt
The next step is to form a hypothesis about why some outputs are failing. In our full walkthrough, we demonstrate the experimentation process by testing out different hypotheses such as swapping out models. However, for demonstration purposes, we will show an experiment that will almost certainly improve your results: modifying the weak system prompt we originally used.
Copy
Ask AI
system_prompt = '''You are a question-answering assistant. For each user question, answer correctly in the form of a rhyme.'''def updated_task(example: Example) -> str: raw_input_value = example.input["attributes.input.value"] data = json.loads(raw_input_value) question = data["messages"][1]["content"] response = openai_client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": question}, ], ) return response.choices[0].message.content
Here, we expect to see improvements in our experiment. The evaluator should flag significantly fewer nonsensical answers as you have refined your system prompt.
Copy
Ask AI
experiment = run_experiment( dataset, task=updated_task, evaluators=[evaluate_response], experiment_name="updated system prompt", experiment_description="updated system prompt",)
Now that we’ve completed a successful experimentation cycle and confirmed our improvements on the annotated traces dataset, we can update the application and test the results on the broader dataset. This helps ensure that improvements made during experimentation translate effectively to real-world usage and that your system performs reliably at scale.
Copy
Ask AI
system_prompt = """You are a question-answering assistant. For each user question, answer correctly in the form of a rhyme."""# Run through the dataset and collect spansdef complete_task(question) -> str: question_str = question["Questions"] response = openai_client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": question_str}, ], ) return response.choices[0].message.contentdef evaluate_all_responses(input, output): response_classifications = llm_classify( dataframe=pd.DataFrame([{"question": input["Questions"], "answer": output}]), template=RHYME_PROMPT_TEMPLATE, model=OpenAIModel(model="gpt-4o"), rails=["correct", "incorrect"], provide_explanation=True, ) score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1) return scoreexperiment = run_experiment( dataset=dataset, #full dataset of questions task=complete_task, evaluators=[evaluate_all_responses], experiment_name="modified-system-prompt-full-dataset",)
Here is a sample prompt you can feed into Lovable (or a similar tool) to start building your custom LLM trace annotation interface. Feel free to adjust it to your needs. Note that you will need to implement functionality to fetch spans and send annotations to Phoenix. We’ve also included a brief explanation of how we approached this in our own implementation. A tool like this can benefit teams that want to collect human annotation data without requiring annotators to work directly within the Phoenix platform. You can also configure features like “thumbs up” and “thumbs down” buttons to streamline filling in annotation fields. Once submitted, the annotations immediately appear in Phoenix.Prompt for Lovable:Build a platform for annotating LLM spans and traces:
Connect to Phoenix Cloud by collecting endpoint, API Key, and project name from the user