What It Is

A self-supervised learning objective where a model is trained to reconstruct clean data from a corrupted version of it. The corruption (noise) is applied artificially during training; the model learns to reverse it.

Why It Matters

Denoising turns unlabeled data into a supervised learning problem: corrupt it, then predict the original. This allows pre-training on massive text corpora without any human annotation. The model must learn meaningful representations in order to denoise well — it can’t just memorize surface patterns.

How It Works

During pre-training, a noising function corrupts an input (masking tokens, deleting words, shuffling sentences, replacing spans). The model is trained to reconstruct the original. Different noise types force the model to learn different skills: span masking forces understanding of multi-token gaps, sentence permutation forces global document coherence. The choice of noise function directly determines what the model learns.

Key Sources