What It Is

Data quality refers to the property of training examples that makes them useful for learning the intended behavior: accuracy, format correctness, diversity of task types, and consistency of style. High-quality data teaches a model the right pattern; low-quality data teaches noise.

Why It Matters

Across alignment fine-tuning, pretraining data curation, and reinforcement learning from feedback, data quality consistently outweighs data quantity past a threshold. LIMA (Zhou et al., 2023) is the clearest demonstration: 1,000 carefully curated instruction-response pairs outperformed 52,000 GPT-3.5-generated examples (Alpaca) in human preference evaluations. Adding more low-quality data to a high-quality dataset actively degrades performance — noise overwhelms signal.

How It Works

Quality in SFT data comes from several dimensions:

  • Accuracy: the response must be factually correct and complete
  • Format appropriateness: the response format should match the task (a step-by-step guide for procedural tasks, a structured explanation for conceptual ones)
  • Diversity: examples should span different task types, domains, and response styles — homogeneous datasets hurt generalization
  • Consistency: style inconsistencies within the dataset create conflicting gradient signals

In practice, this means human-authored or human-curated examples outperform machine-generated ones, even when the machine-generated set is orders of magnitude larger.

Key Sources