What It Is

Mixture of Experts (MoE) is a neural network architecture where only a subset of the model’s parameters (“experts”) activates for each input. A learned routing function (the “gate”) selects which experts to use for each token. The total parameter count can be very large (all experts combined), but the compute per token remains constant (only selected experts execute). MoE decouples model capacity from computational cost.

Why It Matters

Dense models couple these directly: double the parameters, double the compute per token. MoE breaks this coupling. You can train a model with 100× more parameters than a dense model at the same FLOP budget, because most of those parameters are idle at any given moment. This is the architecture behind the largest models deployed today: GPT-4 (reportedly MoE), Mixtral, Gemini, DeepSeek-V3, and Switch Transformer.

The Core Mechanism

A standard Transformer layer has an attention sublayer followed by a feedforward network (FFN). MoE replaces the single FFN with N separate “expert” FFNs and a router:

DENSE TRANSFORMER LAYER:
  Input x
    |
  [Multi-Head Attention]
    |
  [Single FFN: d_model → 4d_model → d_model]
    |
  Output

MoE TRANSFORMER LAYER:
  Input x (one token embedding)
    |
  [Multi-Head Attention]
    |
  [Router: Linear(d_model → N_experts) + softmax]
    → selects top-k experts (k=1 or k=2)
    |
  Token routed to selected experts only:
    Expert 3 FFN: d_model → 4d_model → d_model
    Expert 7 FFN: d_model → 4d_model → d_model  (if k=2)
    |
  Weighted combination: Σ gate_weight_i × expert_i_output
    |
  Output

Parameter count vs. active parameters:

  • Dense 7B model: 7B parameters, 7B active per token
  • MoE model with 8 experts, top-2 routing: ~47B total parameters, ~12B active per token
  • Result: 47B-parameter quality at ~12B-parameter compute cost

Router Design

The router is a learned linear projection from token embedding to expert scores, followed by softmax:

h = softmax(W_r · x) selected_experts = top_k(h)

Token output = Σ_{i in selected} h[i] × Expert_i(x)

The routing probability h[i] serves as both the selection criterion and the output weight. This weighting provides gradient signal for learning which tokens should go to which experts.

Top-1 vs. Top-2 routing:

  • Top-1 (Switch Transformer): simplest, least compute, surprisingly competitive
  • Top-2 (Mixtral, most production systems): smoother gradients, slightly better quality, more memory bandwidth

Load Balancing

Without regularization, all tokens converge to a few popular experts (collapse). This wastes most of the MoE’s capacity.

Auxiliary load-balancing loss (Switch Transformer):

L_aux = N × Σ_i (f_i × P_i)

where f_i = fraction of tokens routed to expert i, P_i = mean routing probability for expert i. This penalizes uneven routing.

Expert capacity buffer: each expert can process at most C = (tokens_per_batch / N_experts) × capacity_factor tokens. Tokens routed to a full expert are dropped (passed through unchanged). This enforces balanced routing via hard constraints during training.

Specialization

A key empirical observation: experts tend to specialize. Different experts handle different token types, domains, or linguistic patterns. This specialization is not designed — it emerges from training. In multilingual models, distinct experts handle different languages. In code models, experts specialize by programming language.

This specialization is what gives MoE models their quality advantage: each token is processed by specialists, not generalists.

Training Stability

MoE models are harder to train than dense models:

  • Router collapse (all tokens to one expert) is common early in training
  • Low-precision training can destabilize routing (bfloat16 for activations, float32 for router logits is standard)
  • Expert dropout during training (randomly dropping entire experts) prevents co-adaptation and improves robustness

Deployment Considerations

Communication overhead: in distributed training and inference, experts are distributed across devices. A token routed to an expert on a different GPU requires inter-device communication — this “all-to-all” communication is a significant overhead for large MoE models.

Memory: even though only a few experts activate per token, all experts must be loaded into memory. A 47B MoE model requires 47B parameters in memory, not 12B.

Fine-tuning difficulty: experts specialize during pretraining. Fine-tuning on a narrow domain may over-activate domain-relevant experts while others atrophy. This can hurt general capability.

Key Sources

  • transformer — MoE is applied within Transformer layers, replacing the FFN
  • scaling-laws — MoE changes scaling dynamics: more capacity at the same compute budget
  • inference-efficiency — MoE reduces active parameters per token, improving throughput at the cost of model size

Open Questions

  • Optimal number of experts and routing k for different model scales?
  • Can expert specialization be controlled or steered deliberately?
  • How to fine-tune MoE models without destroying pretraining specialization?
  • Is top-1 routing consistently competitive with top-2, or does the advantage depend on scale and task?