Evaluating Large Language Models Trained on Code (Codex)

Concepts: pre-training | fine-tuning | sampling | scaling-laws Builds on: language-models-are-few-shot-learners Leads to: self-consistency-chain-of-thought-reasoning | tree-of-thoughts-deliberate-problem-solving

Here’s something that sounds obvious in retrospect: code is the perfect testbed for a language model. Not because programming is more important than language — but because code has a property natural language doesn’t: you can check the answer automatically. Write an essay about climate change and experts will debate whether it’s “correct” for years. Write a function that sorts a list and run the unit tests — pass or fail, no debate, in 30 milliseconds.

The Codex paper (Chen et al., OpenAI, 2021) is built on this insight. It fine-tunes GPT on 159GB of Python scraped from GitHub, introduces a rigorous benchmark (HumanEval) to measure code generation, and — this is the contribution that changed how the field thinks about evaluation — shows that how you sample from the model matters as much as the model itself.

The core idea

The analogy: A studio musician recording a guitar solo never sends the first take. They record 20 takes. 3 of them are brilliant. The rest have missed notes, dropped beats, or just didn’t quite lock in. The skill is absolutely there — it’s just that any single attempt is probabilistic. You extract the best performance by sampling enough.

Codex works the same way. Ask it to write a Python function once — the single sample might be subtly wrong (off-by-one error, wrong edge case). But sample 100 times at a slightly higher temperature and run each against the unit tests. Now you find the versions that work. The model’s distribution contains the correct answer; the question is how many samples you need to extract it.

This reframes the question from “can the model write correct code?” to “how many attempts does it need?” — which is a much more useful question for practitioners.

The mechanism, step by step

Training:

Start with GPT-3’s architecture. Fine-tune it on code from GitHub:

Scraped 54M public repositories
Filtered to Python files under 1MB with mean line length under 100 characters
Result: 159GB of Python after deduplication
Used a code-aware BPE tokenizer trained on the same corpus (better compression for code than GPT-3’s text tokenizer)
Fine-tuned models from 85M to 12B parameters: Codex-85M through Codex-12B

The benchmark — HumanEval:

164 hand-written Python programming problems, each consisting of:

A function signature
A docstring (the specification — this is the prompt)
Unit tests (the judge — never shown to the model)

The task: generate the function body from the docstring alone. Evaluation: does the generated code pass all unit tests?

The metric — pass@k:

Here’s where the paper makes its key methodological contribution. The naive metric — “does the first sample pass?” — undersells the model’s capability. The paper introduces $pass@k$ : the probability that at least one of $k$ samples passes all unit tests.

To compute this without bias, they draw $n > k$ samples and use the unbiased estimator:

$pass@k = Problems E [1 - \frac{( k n - c )}{( k n )}]$

where $n$ is total samples drawn per problem and $c$ is the number that pass. This avoids overestimating by cherry-picking.

Walkthrough with actual numbers:

Say for a given problem, the model draws $n = 100$ samples and $c = 30$ of them pass.

At $k = 1$ (single sample):

$pass@1 = 1 - \frac{( 1 70 )}{( 1 100 )} = 1 - \frac{70}{100} = 0.30$

Exactly as expected: 30% chance the first sample is correct.

At $k = 10$ (pick any 10):

$pass@10 = 1 - \frac{( 10 90 )}{( 10 100 )}$

The fraction $(10 90) / (10 100)$ is the probability that all 10 samples are wrong:

$= \frac{90 \times 89 \times 88 \times \dots \times 81}{100 \times 99 \times 98 \times \dots \times 91} \approx 0.330$

$pass@10 = 1 - 0.330 = 0.670$

So: if 30 out of 100 samples are correct, a single attempt succeeds 30% of the time — but any 10 attempts succeed 67% of the time.

PROBLEM: "write a function to find the longest palindromic substring"
MODEL SAMPLES (n=100, temperature=0.8):

Sample 1:   correct implementation           → PASS
Sample 2:   off-by-one on right boundary     → FAIL
Sample 3:   handles single chars correctly   → PASS
Sample 4:   misses edge case: "" input       → FAIL
...
Sample 30:  correct                          → PASS
Sample 31-100: 70 failures with varied bugs

pass@1  = 30%    (did the first one pass?)
pass@10 = 67%    (did any of 10 random picks pass?)
pass@100 = 95%+  (did any of all 100 pass?)

KEY: the SKILL is there. The distribution contains correct answers.
     Evaluation = how many samples do you need to surface one?

Temperature calibration:

“We find that the optimal sampling temperature varies for different values of k: for pass@1, we use T = 0.2; for pass@100, we use T = 0.8.”

Translation: for single-sample accuracy, lower temperature (more deterministic = higher confidence each sample is right). For many-sample coverage, higher temperature (more diversity = higher chance at least one is right). The sampling strategy is a first-class design choice, not an afterthought.

What’s clever — find the instinct:

The pre-Codex assumption was that code generation should be evaluated exactly like classification: does the single best output match the reference? This mirrors how we test humans on exams: one chance, binary grade.

But this is wrong for LLMs in two ways. First, there are many correct implementations of any function — string comparison against a reference fails almost all of them. Second, and more importantly, the model’s distribution is richer than any single sample. Evaluating a single draw conflates “the model can’t do this” with “the model can do this but you were unlucky.”

“We find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem.”

The pass@k framework separates “capability” (what’s in the distribution) from “reliability” (how often you draw a good sample). These are different axes and practitioners care about both differently depending on their use case.

Also clever: filtering training data by file quality rather than by domain. They don’t select “good code” — they select “code that looks structured” (line length, file size). The model learns what good code looks like by osmosis from the vast majority of well-written Python on GitHub.

Why GPT-3 scores 0%:

“GPT-3 solves 0% of the problems with a single sample.”

GPT-3’s tokenizer wasn’t trained on code; its pretraining corpus was mostly text. A 175B parameter model, trained on the entire internet, cannot write a correct Python function from a docstring — because Python syntax, import conventions, indentation rules, and library idioms are a tiny fraction of what it learned. Fine-tuning on 159GB of Python changes everything: the model now has a prior over code structure that allows it to produce syntactically valid, semantically meaningful programs.

Results

Model	pass@1	pass@10	pass@100
GPT-3 (175B)	0.0%	0.0%	0.0%
GPT-J-6B	11.4%	—	27.7%
Codex-85M	8.2%	12.8%	22.0%
Codex-300M	13.2%	20.9%	36.3%
Codex-2.5B	21.4%	35.1%	59.5%
Codex-12B	28.8%	46.8%	70.2%
Codex-12B + mean logprob reranking	44.3%	—	—

Key observations:

GPT-3 can’t do it at all — fine-tuning on code is essential, not optional.
Scaling within code-trained models follows smooth power laws: each doubling of parameters gives a meaningful pass@1 improvement.
The gap between pass@1 (28.8%) and pass@100 (70.2%) is enormous — the model’s latent capability far exceeds what single-sample evaluation reveals.
Mean logprob reranking (using the model’s own log-probability to pick the best sample) boosts pass@1 from 28.8% to 44.3% — essentially free, and it uses the model’s internal confidence.

What doesn’t work:

“We find that Codex struggles with long chains of operations and with binding variables.”

Long docstrings that describe 5+ sequential operations exceed the model’s effective planning horizon. It handles the first few steps, then drifts. Also: code that depends on previously defined variables (imported at the top of the file, or defined in earlier functions) is hard — the model sometimes ignores them and generates its own.

Security is a real problem the paper addresses honestly: the model will reproduce vulnerable code patterns from its training data. It generates SQL injections, shell injections, and hardcoded credentials — not because it “wants to,” but because that’s what’s in the training distribution.

Practical implications

If you’re building code generation systems, the pass@k framework changes your architecture decisions. For a coding assistant where the user picks from multiple suggestions (like Copilot), optimize for pass@10 or pass@5 — use higher temperature and generate 5-10 candidates. For autonomous code agents that run tests in a loop, optimize for pass@100 with high temperature and fast test execution. These are radically different optimization targets, and conflating them leads to systems that look good on benchmarks but perform poorly in deployment.

The more general lesson: when you have a verifiable reward signal (code tests, math verification, game outcomes), pass@k evaluation with sampling is strictly better than single-sample evaluation. It tells you the right thing: not “did the model nail it this time” but “does the model know how to do this at all.” The distinction matters enormously for setting expectations and for choosing between methods.

This paper established HumanEval as the standard coding benchmark — every coding LLM since 2021 reports pass@1 on HumanEval. It also established the GPT→code fine-tune recipe that produced GitHub Copilot, and the pass@k idea that later generalized into the self-consistency and tree-of-thoughts reasoning methods: sample multiple completions, pick the best one.

The model contains the answer. The question is how many attempts you need to find it.

Connections

pre-training — Codex uses GPT-3’s pretrained weights as starting point, fine-tuned on code
fine-tuning — fine-tuning GPT on 159GB of Python is the core training step
sampling — pass@k and temperature calibration make sampling a first-class modeling decision
scaling-laws — pass@k improves smoothly with model scale from 85M to 12B parameters
language-models-are-few-shot-learners — GPT-3 is the base model Codex fine-tunes from
self-consistency-chain-of-thought-reasoning — generalizes the “sample many, pick best” insight to reasoning
tree-of-thoughts-deliberate-problem-solving — extends sampling from flat to tree-structured search

Citation

arXiv:2107.03374

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., & Zaremba, W. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint. https://arxiv.org/abs/2107.03374

ML Wiki

Explorer