What It Is

Training a language model to write functional programs from natural language specifications — typically a docstring or problem description — and evaluating correctness by running unit tests.

Why It Matters

Code is uniquely verifiable: unlike natural language tasks, correctness is binary and automatic (run the tests, get pass/fail). This makes code generation an ideal rigorous testbed for LLM capability, and enables scalable evaluation via pass@k metrics rather than human judgment.

How It Works

A pretrained language model is fine-tuned on a large corpus of code (e.g., GitHub repositories). At inference time, the model receives a function signature and docstring as a prompt and generates the function body. Multiple samples are drawn at varying temperatures; correctness is determined by executing the generated code against a hidden test suite. The key metric is pass@k: the probability that at least one of k sampled completions passes all tests.

Key Sources