What It Is

Inductive bias is the set of assumptions baked into a model’s architecture that constrain what functions it can represent — independently of the training data. These assumptions guide generalization: when the training data is insufficient to uniquely determine the solution, the inductive bias determines which solution the model picks. Every architecture embeds assumptions; the question is whether those assumptions match the structure of the target problem.

Why It Matters

More inductive bias means better performance with small data — the architecture already knows things it doesn’t have to learn. Less inductive bias means better performance with large data — the model has the flexibility to discover patterns the hard-coded assumptions might have excluded. ViT’s contribution was demonstrating this tradeoff precisely: trained on ImageNet (1.28M images), it underperforms CNNs; trained on JFT-300M (300M images), it outperforms them. The architecture’s assumptions became the bottleneck when data was abundant enough to learn those assumptions directly.

How It Works

In CNNs

Convolutional layers have two structural inductive biases:

Locality: A convolutional filter can only see a local patch (e.g., 3×3 pixels) at each step. It cannot directly process the relationship between pixels 100 positions apart. Global understanding must be built by composing local features through multiple layers.

Translation equivariance: The same filter is applied at every spatial position. If a horizontal edge detector fires on the top-left of an image, the same filter detects horizontal edges everywhere. This forces the network to learn position-invariant features — appropriate because a cat in the top-left and a cat in the bottom-right are both cats.

These assumptions are correct for natural images: important features are local (edges, textures, object parts) and position-invariant (objects can appear anywhere). This is why CNNs generalize well from limited data — the architecture already encodes what’s true about the world.

In ViT

ViT’s patch-based attention has almost no spatial inductive bias:

  • Every patch can attend to every other patch from layer 1 (global receptive field from the start)
  • The same relationship can be learned regardless of spatial distance
  • Position information is injected only through learned position embeddings — the architecture itself doesn’t enforce any spatial structure

This means ViT must learn locality and translation structure from data. With enough data (300M images), it does — and the learned structure is richer and more flexible than the hard-coded CNN version. With limited data, it can’t.

CNN: architecture enforces locality → generalizes from 1M images
ViT: learns locality from data    → needs 100M+ images to match CNN
ViT at scale:                     → outperforms CNN because learned structure > hard-coded structure

Other Common Inductive Biases

ArchitectureInductive BiasCorrect assumption for
CNNLocality, translation equivarianceNatural images
RNNSequential processing, recent tokens matter moreShort-range temporal sequences
TransformerAll positions are equidistant (attention)Language, where long-range dependencies matter
GNNLocal graph structure, neighbor aggregationMolecular graphs, social networks
SSM/MambaSmooth state evolution, recencyLong sequences with local structure

The Bias-Variance Decomposition View

Inductive bias reduces model variance (constrains the hypothesis space, reducing sensitivity to training data) at the cost of potentially increasing bias (if the constraint is wrong for the problem).

Formally, for a model family H constrained by architectural bias:

  • Low bias: H contains the true function — the right structure was assumed
  • High bias: The architectural constraint excludes the true function — wrong assumptions
  • Low variance: The constrained H generalizes reliably from less data
  • High variance: Unconstrained H fits training data but generalizes poorly without sufficient data

The practical question: is the inductive bias correct for the domain? CNNs made the right bet on natural images. They make the wrong bet on graphs (where translation equivariance is meaningless) and irregular data (protein contact maps, molecular geometry).

What’s Clever

The resolution of the “CNN vs. Transformer for vision” debate through scale is a clean empirical demonstration of a theoretical concept. The “no free lunch” theorem implies that no learning algorithm is universally better — performance depends on whether the algorithm’s inductive bias matches the problem structure. ViT’s result doesn’t refute this; it demonstrates that when data is sufficient, the model can learn the correct structure from data, making hard-coded structure redundant.

Transfer learning changes the calculus: a model pretrained on 300M images has already learned locality and translation structure from data. When you fine-tune it on 10K examples, the learned structure transfers — providing the equivalent of inductive bias without hard-coding it. This is why transfer learning can substitute for inductive bias in data-limited regimes.

Key Sources

  • vision-transformer — ViT is the canonical example of trading inductive bias for scalability
  • transfer-learning — pretraining can substitute for inductive bias by learning structure from large-scale data
  • patch-embeddings — ViT’s patch embedding design deliberately avoids spatial inductive bias
  • scaling-laws — scaling laws operate in the low-bias, data-rich regime where learned structure dominates
  • grokking — grokking can be viewed as the model discovering the correct inductive bias (underlying algorithm) late in training

Open Questions

  • Can architectures learn their own inductive biases from data, rather than having them fixed by design? (Neural Architecture Search is a partial answer)
  • Do the correct inductive biases for language and vision converge at scale, or does domain-specific structure remain important?
  • How do you measure the degree of inductive bias in an architecture quantitatively?