Data Quality

What It Is

Data quality refers to the property of training examples that makes them useful for learning the intended behavior: accuracy, format correctness, diversity of task types, and consistency of style. High-quality data teaches a model the right pattern; low-quality data teaches noise.

Why It Matters

Across alignment fine-tuning, pretraining data curation, and reinforcement learning from feedback, data quality consistently outweighs data quantity past a threshold. LIMA (Zhou et al., 2023) is the clearest demonstration: 1,000 carefully curated instruction-response pairs outperformed 52,000 GPT-3.5-generated examples (Alpaca) in human preference evaluations. Adding more low-quality data to a high-quality dataset actively degrades performance — noise overwhelms signal.

How It Works

Quality in SFT data comes from several dimensions:

Accuracy: the response must be factually correct and complete
Format appropriateness: the response format should match the task (a step-by-step guide for procedural tasks, a structured explanation for conceptual ones)
Diversity: examples should span different task types, domains, and response styles — homogeneous datasets hurt generalization
Consistency: style inconsistencies within the dataset create conflicting gradient signals

In practice, this means human-authored or human-curated examples outperform machine-generated ones, even when the machine-generated set is orders of magnitude larger.

ML Wiki

Explorer

Data Quality

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Data Quality

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks