Summary

Touvron et al. (2023) introduce LLaMA, a family of foundation language models ranging from 7B to 65B parameters trained exclusively on publicly available data. The core argument inverts the Kaplan scaling law prescription: rather than training the largest model until compute budget is exhausted, LLaMA trains smaller models on far more tokens than Chinchilla-optimal — trading training compute for inference efficiency. A model that is cheaper to serve at inference time may be more practical even if it costs more to train. LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with Chinchilla-70B and PaLM-540B.

LLaMA’s significance was amplified by its public release to the research community. Within weeks, the weights were leaked and formed the basis for an explosion of fine-tuned variants (Alpaca, Vicuna, WizardLM) and subsequent open-weight series (LLaMA 2, Mistral, Mixtral). The paper demonstrated that state-of-the-art capabilities were achievable without proprietary data, making large-scale LLM research broadly accessible. The architectural choices — RMSNorm, SwiGLU activations, RoPE positional embeddings — became the default template for most subsequent open-weight models.

Key Claims

  • LLaMA-13B outperforms GPT-3 (175B) on most standard NLP benchmarks despite being 13× smaller in parameter count.
  • LLaMA-65B is competitive with Chinchilla-70B (which uses equivalent compute) and PaLM-540B.
  • All models are trained exclusively on publicly available data: CommonCrawl, C4, GitHub, Wikipedia, Books, ArXiv, StackExchange (totaling ~1.4T tokens for most sizes).
  • LLaMA-7B, trained on 1T tokens, outperforms GPT-3 on BoolQ, PIQA, HellaSwag, WinoGrande, ARC, and OpenBookQA.
  • Inference-efficient design: LLaMA-13B runs on a single V100 GPU; LLaMA-65B fits on 2 A100 80GB GPUs.

Methods

All LLaMA models use a decoder-only Transformer with three architectural modifications from the original design: (1) Pre-normalization with RMSNorm applied to the input of each sub-layer rather than the output, for training stability; (2) SwiGLU activation function replacing ReLU in the FFN, following PaLM; (3) Rotary Positional Embeddings (RoPE) replacing absolute positional embeddings. Context length is 2048 tokens. Training uses AdamW (β1=0.9, β2=0.95), cosine LR schedule, weight decay 0.1, gradient clipping 1.0. The data pipeline applies aggressive deduplication and quality filtering to web crawl data. Models range: 7B (32 layers), 13B (40 layers), 33B (60 layers), 65B (80 layers).

Failure modes

  • LLaMA base models are not instruction-following; they require fine-tuning (SFT + RLHF) for conversational use.
  • The 2048-token context window is shorter than GPT-3.5/4 and limits use on long documents.
  • Training data stops at early 2023; knowledge cutoff makes it weak on recent events.
  • LLaMA-7B is significantly weaker than LLaMA-65B on reasoning benchmarks (MATH, GSM8K) — smaller models do not inherit all capabilities.
  • Initial release was research-only; commercial use of the original LLaMA weights was restricted (changed with LLaMA 2).

Connections

Citation

arXiv:2302.13971

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv preprint. https://arxiv.org/abs/2302.13971