When To Use Code Generation Eval Template
This Eval checks the correctness and readability of the code from a code generation process. The template variables are:- query: The query is the coding question being asked
- code: The code is the code that was returned.
Code Generation Eval Template
We are continually iterating our templates, view the most up-to-date template on GitHub.
How To Run the Code Generation Eval
Benchmark Results
This benchmark was obtained using notebook below. It was run using an OpenAI Human Eval dataset as a ground truth dataset. Each example in the dataset was evaluating using theCODE_READABILITY_PROMPT_TEMPLATE above, then the resulting labels were compared against the ground truth label in the benchmark dataset to generate the confusion matrices below.
Google Colab
Try it out!
GPT-4 Results

| Code Eval | GPT-4 | GPT-4 | GPT-4 | GPT-3.5 | GPT-3.5-Instruct | Palm 2 (Text Bison) |
|---|---|---|---|---|---|---|
| Precision | 0.93 | 0.93 | 0.93 | 0.76 | 0.67 | 0.77 |
| Recall | 0.78 | 0.78 | 0.78 | 0.93 | 1 | 0.94 |
| F1 | 0.85 | 0.85 | 0.85 | 0.85 | 0.81 | 0.85 |

