Summary

Jiang et al. (2024) introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model that achieves strong performance while activating only a fraction of its total parameters per token. Each transformer layer replaces the single feed-forward network (FFN) with 8 parallel expert FFNs; a router network selects the top-2 experts per token per layer. Although the model has 47B total parameters, only ~13B are active during any given forward pass — giving inference costs similar to a 13B dense model while having the representation capacity of a much larger one. Mixtral outperforms or matches Llama 2 70B and GPT-3.5 on evaluated benchmarks with roughly 5× faster inference throughput versus a dense 70B model.

The paper also introduces Mixtral 8x7B-Instruct, fine-tuned with supervised instruction following and direct preference optimization (DPO), which surpasses GPT-3.5 Turbo, Claude-2.1, and Gemini Pro on human preference benchmarks. Both models are released under the Apache 2.0 license. The architecture follows Mistral 7B exactly except for the MoE FFN substitution, enabling a clean ablation of the MoE contribution. Mixtral demonstrated that sparse MoE was practically viable at open-weight scale and triggered widespread adoption of MoE architectures in subsequent models (Mixtral 8x22B, DeepSeek-MoE, etc.).

Key Claims

  • Mixtral 8x7B has 47B total parameters but uses only 13B active parameters per token, achieving dense-13B inference costs.
  • Outperforms Llama 2 70B on HellaSwag (89.2 vs 87.6), Arc-Challenge (66.0 vs 61.3), and WinoGrande (81.2 vs 80.2).
  • Significantly outperforms Llama 2 70B on GSM8K math (74.4 vs 56.8) and HumanEval code (40.2 vs 29.9).
  • Mixtral 8x7B-Instruct surpasses GPT-3.5 Turbo and Claude-2.1 on MT-Bench and human preference evaluations.
  • Context window of 32,768 tokens, matching or exceeding contemporary open-weight models.

Methods

Mixtral uses the Mistral 7B Transformer architecture (decoder-only, grouped-query attention, sliding window attention, RoPE) with a single modification: each FFN sublayer is replaced by a mixture of 8 expert FFNs. The router is a linear layer that produces 8 logits per token; the top-2 are selected via argmax and their outputs are combined with softmax-normalized weights. Specifically, for token x at layer l: output = Σ_{i∈Top2} softmax(G(x))_i · FFN_i(x). Because only 2 of 8 experts activate per token, experts naturally specialize — analysis shows different experts activate for syntactically vs. semantically similar tokens. Training uses expert parallelism across 8 GPUs per group. The 32k context window is supported by RoPE with extended base frequency.

Failure modes

  • Load imbalance: without auxiliary losses, some experts can be overloaded while others are rarely used, degrading training efficiency and expert specialization.
  • MoE models are harder to serve than dense models: all 47B parameters must reside in memory (or be distributed) even though only 13B activate per token.
  • Top-2 routing is a fixed architectural choice; it is not clear that 2 is optimal for all tasks or domains.
  • Performance on multilingual tasks, while better than Llama 2 70B, is still weaker than models explicitly trained on diverse multilingual data.
  • Expert routing decisions are not interpretable; it is difficult to diagnose or control which experts activate for a given input.

Connections

Citation

arXiv:2401.04088

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P., Subramanian, S., Yang, S., … El Sayed, W. (2024). Mixtral of Experts. arXiv preprint. https://arxiv.org/abs/2401.04088