Inductive Bias

What It Is

Inductive bias is the set of assumptions baked into a model’s architecture that constrain what functions it can represent — independently of the training data. These assumptions guide generalization: when the training data is insufficient to uniquely determine the solution, the inductive bias determines which solution the model picks. Every architecture embeds assumptions; the question is whether those assumptions match the structure of the target problem.

Why It Matters

More inductive bias means better performance with small data — the architecture already knows things it doesn’t have to learn. Less inductive bias means better performance with large data — the model has the flexibility to discover patterns the hard-coded assumptions might have excluded. ViT’s contribution was demonstrating this tradeoff precisely: trained on ImageNet (1.28M images), it underperforms CNNs; trained on JFT-300M (300M images), it outperforms them. The architecture’s assumptions became the bottleneck when data was abundant enough to learn those assumptions directly.

How It Works

In CNNs

Convolutional layers have two structural inductive biases:

Locality: A convolutional filter can only see a local patch (e.g., 3×3 pixels) at each step. It cannot directly process the relationship between pixels 100 positions apart. Global understanding must be built by composing local features through multiple layers.

Translation equivariance: The same filter is applied at every spatial position. If a horizontal edge detector fires on the top-left of an image, the same filter detects horizontal edges everywhere. This forces the network to learn position-invariant features — appropriate because a cat in the top-left and a cat in the bottom-right are both cats.

These assumptions are correct for natural images: important features are local (edges, textures, object parts) and position-invariant (objects can appear anywhere). This is why CNNs generalize well from limited data — the architecture already encodes what’s true about the world.

In ViT

ViT’s patch-based attention has almost no spatial inductive bias:

Every patch can attend to every other patch from layer 1 (global receptive field from the start)
The same relationship can be learned regardless of spatial distance
Position information is injected only through learned position embeddings — the architecture itself doesn’t enforce any spatial structure

This means ViT must learn locality and translation structure from data. With enough data (300M images), it does — and the learned structure is richer and more flexible than the hard-coded CNN version. With limited data, it can’t.

CNN: architecture enforces locality → generalizes from 1M images
ViT: learns locality from data    → needs 100M+ images to match CNN
ViT at scale:                     → outperforms CNN because learned structure > hard-coded structure

Other Common Inductive Biases

Architecture	Inductive Bias	Correct assumption for
CNN	Locality, translation equivariance	Natural images
RNN	Sequential processing, recent tokens matter more	Short-range temporal sequences
Transformer	All positions are equidistant (attention)	Language, where long-range dependencies matter
GNN	Local graph structure, neighbor aggregation	Molecular graphs, social networks
SSM/Mamba	Smooth state evolution, recency	Long sequences with local structure

The Bias-Variance Decomposition View

Inductive bias reduces model variance (constrains the hypothesis space, reducing sensitivity to training data) at the cost of potentially increasing bias (if the constraint is wrong for the problem).

Formally, for a model family H constrained by architectural bias:

Low bias: H contains the true function — the right structure was assumed
High bias: The architectural constraint excludes the true function — wrong assumptions
Low variance: The constrained H generalizes reliably from less data
High variance: Unconstrained H fits training data but generalizes poorly without sufficient data

The practical question: is the inductive bias correct for the domain? CNNs made the right bet on natural images. They make the wrong bet on graphs (where translation equivariance is meaningless) and irregular data (protein contact maps, molecular geometry).

What’s Clever

The resolution of the “CNN vs. Transformer for vision” debate through scale is a clean empirical demonstration of a theoretical concept. The “no free lunch” theorem implies that no learning algorithm is universally better — performance depends on whether the algorithm’s inductive bias matches the problem structure. ViT’s result doesn’t refute this; it demonstrates that when data is sufficient, the model can learn the correct structure from data, making hard-coded structure redundant.

Transfer learning changes the calculus: a model pretrained on 300M images has already learned locality and translation structure from data. When you fine-tune it on 10K examples, the learned structure transfers — providing the equivalent of inductive bias without hard-coding it. This is why transfer learning can substitute for inductive bias in data-limited regimes.

Key Sources

an-image-is-worth-16x16-words — ViT; demonstrates the inductive bias vs. data scale tradeoff empirically; trains ViT at multiple scales
alibi-train-short-test-long — ALiBi encodes a recency prior directly in attention scores via a fixed slope; demonstrates how a hard-coded inductive bias can substitute for learned positional parameters

vision-transformer — ViT is the canonical example of trading inductive bias for scalability
transfer-learning — pretraining can substitute for inductive bias by learning structure from large-scale data
patch-embeddings — ViT’s patch embedding design deliberately avoids spatial inductive bias
scaling-laws — scaling laws operate in the low-bias, data-rich regime where learned structure dominates
grokking — grokking can be viewed as the model discovering the correct inductive bias (underlying algorithm) late in training

Open Questions

Can architectures learn their own inductive biases from data, rather than having them fixed by design? (Neural Architecture Search is a partial answer)
Do the correct inductive biases for language and vision converge at scale, or does domain-specific structure remain important?
How do you measure the degree of inductive bias in an architecture quantitatively?

ML Wiki

Explorer

Inductive Bias

What It Is

Why It Matters

How It Works

In CNNs

In ViT

Other Common Inductive Biases

The Bias-Variance Decomposition View

What’s Clever

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Inductive Bias

What It Is

Why It Matters

How It Works

In CNNs

In ViT

Other Common Inductive Biases

The Bias-Variance Decomposition View

What’s Clever

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks