Highly Accurate Protein Structure Prediction with AlphaFold

Concepts: attention | transformer | evoformer | protein-structure | self-supervised-learning Builds on: attention-is-all-you-need Leads to: —

There are about 200 million known protein sequences and roughly 100,000 experimentally solved structures. Every sequence encodes a shape. Every shape encodes a function. But going from sequence to shape has been one of biology’s hardest open problems for over 50 years — requiring months of X-ray crystallography or cryo-electron microscopy per protein. AlphaFold closes that gap. It predicts 3D protein structure from amino acid sequence alone, at atomic accuracy, in GPU minutes.

The problem

Proteins are chains of amino acids that fold into precise 3D shapes, and the shape is everything. A misfolded protein can cause Parkinson’s, Alzheimer’s, cystic fibrosis. A correctly folded one can be the target of a drug that cures an infection. For 50 years, the field knew the sequence determined the shape — Anfinsen’s thermodynamic hypothesis, Nobel Prize 1972 — but couldn’t compute it.

“Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’—has been an important open research problem for more than 50 years.”

The challenge: a 100-residue protein has roughly $3^{100}$ possible backbone conformations if each residue can take 3 positions. Levinthal’s paradox — if a protein tried each conformation for 10 picoseconds, sampling all of them would take longer than the age of the universe. Real proteins fold in milliseconds. There’s a shortcut hidden in evolution. Finding it computationally is what AlphaFold does.

The core idea

The analogy: Imagine you’re trying to reconstruct the seating arrangement at a dinner party, but you weren’t there. You only have one clue: who tends to sit near whom across hundreds of past parties. If Alice always ends up near Bob and away from Carol, that’s a spatial constraint. Enough of these pairwise constraints, properly reasoned about, triangulate the full arrangement.

That’s what AlphaFold does — but for biology. Proteins in the same family across different species evolve together. If residue 42 mutates (say, from alanine to valine), and residue 87 almost always also mutates to compensate, that’s evidence they’re physically touching in the folded structure. A mutation at one end must be matched by a mutation at the other end to keep the protein stable. These co-mutations, mined across millions of sequences in a multiple sequence alignment (MSA), are the spatial coordinates hiding in plain sight.

The key challenge: extract those constraints, enforce geometric consistency, and predict 3D coordinates. AlphaFold solves this with two stacked components — Evoformer and the Structure Module.

The mechanism, step by step

Input. For a target sequence of N residues, AlphaFold builds:

An MSA representation of shape (N_seq × N): rows are homologous sequences from across evolution, columns are residue positions. This encodes which residues co-evolve.
A pair representation of shape (N × N): for every pair of residues (i, j), a learned vector encoding their predicted spatial relationship.

Evoformer. 48 blocks, each updating both representations simultaneously. The key operations:

MSA row attention (within-sequence): for each sequence in the MSA, each residue attends to every other residue in that same sequence. The attention bias comes from the pair representation — so spatial constraints directly modulate sequence-level attention.
MSA column attention (across-sequence): for each residue position, attention runs across all sequences in the MSA. This is how the model aggregates evolutionary signal — “what do all homologues tell us about position 42?”
Outer product mean: the MSA representation updates the pair representation. For each pair (i, j), take the outer product of their MSA column representations and average over sequences. This continuously pushes evolutionary signal into the pairwise distance predictions.
Triangle updates: the pair representation gets updated with triangle-constrained operations. If (i, j) and (j, k) are known, then (i, k) is constrained by the triangle inequality. AlphaFold enforces this with triangle multiplicative updates and triangle self-attention.

“For a pairwise description of amino acids to be representable as a single 3D structure, many constraints must be satisfied including the triangle inequality on distances. On the basis of this intuition, we arrange the update operations on the pair representation in terms of triangles of edges involving three different nodes.”

INPUT SEQUENCE: A-G-C-L-...-K  (N residues)
        |
        v
  Search sequence databases
        |
        v
+------------------+   +------------------+
| MSA              |   | Pair repr.       |
| (N_seq x N)      |   | (N x N)          |
| rows: homologues |   | spatial rel.     |
| cols: positions  |   | between residues |
+--------+---------+   +--------+---------+
         |                      |
         v                      v
  +------+----------------------+------+
  |          EVOFORMER (x48 blocks)    |
  |                                    |
  |  MSA row attention  (within seq)   |
  |  MSA col attention  (across seqs)  |
  |  Outer product → pair update       |
  |  Triangle attn  → pair update      |
  |  Pair bias    → MSA update         |
  +------+----------------------+------+
         |                      |
         v                      v
   MSA repr.             Pair repr.
  (refined)              (refined)
         |
         v (first row only = single repr.)
  +------+------------------------------+
  |       STRUCTURE MODULE              |
  |  Invariant Point Attention (IPA)    |
  |  Per-residue rigid body frames      |
  |  χ-angle prediction (side chains)  |
  +------+------------------------------+
         |
         v
  3D coordinates (all atoms)
         |
         v RECYCLING (x3 passes)

Structure Module. Takes the Evoformer’s final single representation (first row of MSA) plus the pair representation and converts them to 3D coordinates. The key component is Invariant Point Attention (IPA) — an attention mechanism that operates on per-residue rigid body frames (backbone N-Cα-C triangles) in 3D space. Because IPA is SE(3)-equivariant — its outputs don’t depend on the global orientation of the protein — the model learns to reason about local geometry without ever seeing absolute coordinates.

Side chains are predicted via torsion angles (χ₁–χ₄ for each residue), which then place the atoms geometrically.

Loss: Frame Aligned Point Error (FAPE). The training signal measures the per-atom distance error across all possible backbone frames:

$L_{FAPE} = \frac{1}{N _{frames} \cdot N _{atoms}} \sum_{k, i} d (T_{k}^{- 1} \circ \overset{x}{^}_{i}, T_{k}^{- 1} \circ x_{i})$

where $T_{k}$ is a rigid body frame (rotation + translation), $\overset{x}{^}_{i}$ is the predicted atom position, and $x_{i}$ is the true position. By averaging across all frames, FAPE penalizes local structural errors without being confused by global rotations. Translation: even if the whole protein is rotated, the loss still detects when individual residues are in the wrong position relative to their neighbors.

Recycling. The output 3D structure is fed back in as input for 3 iterations:

“We reinforce the notion of iterative refinement by repeatedly applying the final loss to outputs and then feeding the outputs recursively into the same modules.”

This is the same insight as iterative refinement in image segmentation: a coarse first guess gets refined with its own structure as additional context.

Walkthrough with concrete numbers. Consider two residues $i = 42$ and $j = 87$ in a 100-residue protein. After scanning 5,000 homologous sequences:

Co-mutation frequency (42, 87): 0.73  ← high → likely in contact
Co-mutation frequency (42, 50): 0.12  ← low  → likely not in contact

Initial pair repr. after Evoformer block 1:
  (42, 87): distance logit peak at bin 3.5–4.5 Å → ~4 Å predicted
  (42, 50): distance logit peak at bin 12–14 Å → ~13 Å predicted

Triangle update: if (42, 87) ≈ 4 Å and (87, 93) ≈ 6 Å,
  then (42, 93) must be ≤ 10 Å (triangle inequality)
  → pair repr. (42, 93) shifts toward shorter-distance bins

After 48 Evoformer blocks: 6,000 triangle-consistent pairwise predictions
  → Structure module places Cα(42) and Cα(87) at 3.8 Å apart in 3D
  → Matches crystal structure at 0.9 Å RMSD for this region

What’s clever — the instinct. Why hadn’t co-mutation analysis cracked protein folding before? It had been tried since the 1990s. The key failure was transitivity: you can’t independently predict each pair and then assemble them. If (i, j) ≈ 4 Å and (j, k) ≈ 4 Å, then (i, k) is somewhere between 0 and 8 Å — that’s too much uncertainty. Every pair’s prediction constrained every other pair, and solving the resulting constraint satisfaction problem required hand-crafted energy functions that never generalized.

AlphaFold’s insight is to enforce consistency inside the network, not as a post-processing step. The triangle attention operations directly make the pair representation globally self-consistent before it ever reaches the structure module. The network learns which triangles matter for which fold family. No hand-crafted physics, no explicit constraint solver — just a sufficiently expressive attention architecture trained on 100K PDB structures.

“This is a combination of the bioinformatics and physical approaches: we use a physical and geometric inductive bias to build components that learn from PDB data with minimal imposition of handcrafted features.”

Does it work? What breaks?

Benchmark	AlphaFold	Next-best method	What this means
CASP14 backbone RMSD (95% coverage)	0.96 Å	2.8 Å	Within half a carbon-atom width of experiment
CASP14 all-atom RMSD	1.5 Å	3.5 Å	Side chains placed correctly too
2,180-residue protein (no homologues)	Correct domain packing	Incorrect / incomplete	Scales to very long proteins

“As a comparison point for this accuracy, the width of a carbon atom is approximately 1.4 Å.”

In CASP14, AlphaFold’s median accuracy beat the next-best method by nearly 3x — a margin so large that the organizers described it as solving the problem.

What doesn’t work:

Intrinsically disordered regions (IDRs): proteins that don’t have a stable single fold are poorly predicted. AlphaFold’s pLDDT (confidence score) correctly identifies these as low-confidence, but the coordinate outputs are meaningless.
Novel folds with no MSA depth: if a protein has few homologues in sequence databases, the co-mutation signal is too sparse. Performance degrades, though the model can still produce reasonable predictions from structural templates.
Protein-protein interactions: AlphaFold2 was not designed for complexes. AlphaFold-Multimer extends this, but with less accuracy than single-chain predictions.
Ligand binding: the model predicts structures “as they appear in PDB” — it can implicitly handle common cofactors seen in training, but doesn’t explicitly model small molecule binding or conformational changes on binding.

So what?

“the model is able to provide precise, per-residue estimates of its reliability that should enable the confident use of these predictions”

If you work on drug discovery, protein engineering, or molecular biology: the AlphaFold Protein Structure Database (AFDB) now covers over 200 million proteins across all UniRef90 sequences. The structure you need probably already exists at https://alphafold.ebi.ac.uk. For novel proteins, run ColabFold (the free open version) — you’ll have a structure in under 10 minutes.

The design choices map directly onto attention. Evoformer uses the same scaled dot-product attention from attention-is-all-you-need, but applies it in three distinct geometries: across residues in a sequence (row), across sequences at a position (column), and across triangles of pairwise relationships. The triangle constraint is the key move that standard transformer attention doesn’t have — it bakes in the fact that distances must be geometrically consistent.

The broader lesson: when you have a hard combinatorial problem that seems intractable, ask what information the world has provided for free. Billions of years of evolution already solved the protein folding problem for every protein that ever lived. AlphaFold just learned to read the solution from the residue.

Predicted atomic accuracy without knowing any physics by hand. Just attention, triangles, and 50 years of biology’s most precious data.

Connections

evoformer — the Evoformer architecture introduced in this paper
attention — row/column MSA attention and triangle self-attention are the core update operations
transformer — Evoformer is a transformer variant with biologically-motivated attention patterns
protein-structure — the application domain; this paper largely defines the modern approach
self-supervised-learning — evolutionary MSA signals derived from unlabeled sequences drive the pairwise representation
attention-is-all-you-need — scaled dot-product attention is the basis for all Evoformer attention operations

Citation

Nature: 10.1038/s41586-021-03819-2

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., …Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589. https://doi.org/10.1038/s41586-021-03819-2

ML Wiki

Explorer