What It Is
Inductive bias is the set of assumptions baked into a model’s architecture that constrain what functions it can represent — independently of the training data. These assumptions guide generalization: when the training data is insufficient to uniquely determine the solution, the inductive bias determines which solution the model picks. Every architecture embeds assumptions; the question is whether those assumptions match the structure of the target problem.
Why It Matters
More inductive bias means better performance with small data — the architecture already knows things it doesn’t have to learn. Less inductive bias means better performance with large data — the model has the flexibility to discover patterns the hard-coded assumptions might have excluded. ViT’s contribution was demonstrating this tradeoff precisely: trained on ImageNet (1.28M images), it underperforms CNNs; trained on JFT-300M (300M images), it outperforms them. The architecture’s assumptions became the bottleneck when data was abundant enough to learn those assumptions directly.
How It Works
In CNNs
Convolutional layers have two structural inductive biases:
Locality: A convolutional filter can only see a local patch (e.g., 3×3 pixels) at each step. It cannot directly process the relationship between pixels 100 positions apart. Global understanding must be built by composing local features through multiple layers.
Translation equivariance: The same filter is applied at every spatial position. If a horizontal edge detector fires on the top-left of an image, the same filter detects horizontal edges everywhere. This forces the network to learn position-invariant features — appropriate because a cat in the top-left and a cat in the bottom-right are both cats.
These assumptions are correct for natural images: important features are local (edges, textures, object parts) and position-invariant (objects can appear anywhere). This is why CNNs generalize well from limited data — the architecture already encodes what’s true about the world.
In ViT
ViT’s patch-based attention has almost no spatial inductive bias:
- Every patch can attend to every other patch from layer 1 (global receptive field from the start)
- The same relationship can be learned regardless of spatial distance
- Position information is injected only through learned position embeddings — the architecture itself doesn’t enforce any spatial structure
This means ViT must learn locality and translation structure from data. With enough data (300M images), it does — and the learned structure is richer and more flexible than the hard-coded CNN version. With limited data, it can’t.
CNN: architecture enforces locality → generalizes from 1M images
ViT: learns locality from data → needs 100M+ images to match CNN
ViT at scale: → outperforms CNN because learned structure > hard-coded structure
Other Common Inductive Biases
| Architecture | Inductive Bias | Correct assumption for |
|---|---|---|
| CNN | Locality, translation equivariance | Natural images |
| RNN | Sequential processing, recent tokens matter more | Short-range temporal sequences |
| Transformer | All positions are equidistant (attention) | Language, where long-range dependencies matter |
| GNN | Local graph structure, neighbor aggregation | Molecular graphs, social networks |
| SSM/Mamba | Smooth state evolution, recency | Long sequences with local structure |
The Bias-Variance Decomposition View
Inductive bias reduces model variance (constrains the hypothesis space, reducing sensitivity to training data) at the cost of potentially increasing bias (if the constraint is wrong for the problem).
Formally, for a model family H constrained by architectural bias:
- Low bias: H contains the true function — the right structure was assumed
- High bias: The architectural constraint excludes the true function — wrong assumptions
- Low variance: The constrained H generalizes reliably from less data
- High variance: Unconstrained H fits training data but generalizes poorly without sufficient data
The practical question: is the inductive bias correct for the domain? CNNs made the right bet on natural images. They make the wrong bet on graphs (where translation equivariance is meaningless) and irregular data (protein contact maps, molecular geometry).
What’s Clever
The resolution of the “CNN vs. Transformer for vision” debate through scale is a clean empirical demonstration of a theoretical concept. The “no free lunch” theorem implies that no learning algorithm is universally better — performance depends on whether the algorithm’s inductive bias matches the problem structure. ViT’s result doesn’t refute this; it demonstrates that when data is sufficient, the model can learn the correct structure from data, making hard-coded structure redundant.
Transfer learning changes the calculus: a model pretrained on 300M images has already learned locality and translation structure from data. When you fine-tune it on 10K examples, the learned structure transfers — providing the equivalent of inductive bias without hard-coding it. This is why transfer learning can substitute for inductive bias in data-limited regimes.
Key Sources
- an-image-is-worth-16x16-words — ViT; demonstrates the inductive bias vs. data scale tradeoff empirically; trains ViT at multiple scales
Related Concepts
- vision-transformer — ViT is the canonical example of trading inductive bias for scalability
- transfer-learning — pretraining can substitute for inductive bias by learning structure from large-scale data
- patch-embeddings — ViT’s patch embedding design deliberately avoids spatial inductive bias
- scaling-laws — scaling laws operate in the low-bias, data-rich regime where learned structure dominates
- grokking — grokking can be viewed as the model discovering the correct inductive bias (underlying algorithm) late in training
Open Questions
- Can architectures learn their own inductive biases from data, rather than having them fixed by design? (Neural Architecture Search is a partial answer)
- Do the correct inductive biases for language and vision converge at scale, or does domain-specific structure remain important?
- How do you measure the degree of inductive bias in an architecture quantitatively?