What It Is
Data quality refers to the property of training examples that makes them useful for learning the intended behavior: accuracy, format correctness, diversity of task types, and consistency of style. High-quality data teaches a model the right pattern; low-quality data teaches noise.
Why It Matters
Across alignment fine-tuning, pretraining data curation, and reinforcement learning from feedback, data quality consistently outweighs data quantity past a threshold. LIMA (Zhou et al., 2023) is the clearest demonstration: 1,000 carefully curated instruction-response pairs outperformed 52,000 GPT-3.5-generated examples (Alpaca) in human preference evaluations. Adding more low-quality data to a high-quality dataset actively degrades performance — noise overwhelms signal.
How It Works
Quality in SFT data comes from several dimensions:
- Accuracy: the response must be factually correct and complete
- Format appropriateness: the response format should match the task (a step-by-step guide for procedural tasks, a structured explanation for conceptual ones)
- Diversity: examples should span different task types, domains, and response styles — homogeneous datasets hurt generalization
- Consistency: style inconsistencies within the dataset create conflicting gradient signals
In practice, this means human-authored or human-curated examples outperform machine-generated ones, even when the machine-generated set is orders of magnitude larger.