Word Embeddings

The Problem

Pre-2013 NLP represented words as one-hot vectors — a 50,000-dim vector with a single 1, the rest zeros. This is fine for indexing but useless for similarity: “cat” and “kitten” have a cosine similarity of zero, the same as “cat” and “kerosene.” Worse, learning anything from one-hots required separately learning relationships for every word pair — an N² problem. Pre-trained dense word representations existed (LSA, latent semantic analysis) but were either too slow to train at scale or didn’t capture rich semantic relationships. The problem: how to assign every word a low-dimensional dense vector that captures its meaning, learnable from raw text alone.

The Key Insight

You shall know a word by the company it keeps (Firth, 1957). Train a model whose objective is to predict surrounding words from a center word (or vice versa). The model’s hidden representations of words — what it’s forced to compute to do this prediction — will end up encoding the words’ co-occurrence patterns. Words that appear in similar contexts (cat, kitten, dog) will get similar vectors. This is the distributional hypothesis, made operational.

Mechanism in Plain English

Tokenize a large text corpus into a sequence of word IDs.
Define a prediction task: given a center word, predict the surrounding words within a window (Skip-gram); or given the surrounding words, predict the center (CBOW).
Build a model: two embedding tables, $W_{in}$ (input/center) and $W_{o u t}$ (output/context). Each table has one row per vocabulary word, each row is a $d$ -dimensional vector.
For a training example (center, context), the score is $W_{in} [ce n t er] \cdot W_{o u t} [co n t e x t]$ . Train to maximize this for true pairs and minimize for negatives (negative sampling).
After training, $W_{in}$ is the matrix of word embeddings.

ASCII Diagram

SKIP-GRAM (Word2Vec):

  Sentence: "the quick brown FOX jumps over"
                              |
                       center word (FOX)
                              |
                       look up W_in[FOX]
                              |
                              v
                       300-d vector
                              |
                              v
                  for each context word w in window:
                      score = vector @ W_out[w]
                      softmax over vocabulary
                      train to make true context words high


  After training, W_in rows are the word embeddings.
  Words with similar contexts get similar vectors.

Math with Translation

Skip-gram with negative sampling:

$L = - lo g σ (W_{in} [c] \cdot W_{o u t} [o]) - \sum_{k = 1}^{K} E_{w_{k} \sim P_{n}} lo g σ (- W_{in} [c] \cdot W_{o u t} [w_{k}])$

$c$ = center word, $o$ = observed (true) context word.
$σ$ = sigmoid.
The first term: maximize the score for the true (center, context) pair.
The second term: minimize the score for $K$ random negatives (typically $K = 5 - 15$ ), where negatives are sampled from a unigram distribution raised to the 3/4 power (slightly upweights rare words).

The famous geometric property:

$W_{in} [king] - W_{in} [man] + W_{in} [woman] \approx W_{in} [queen]$

The model never explicitly learns this — but the vector subspace where “king” differs from “queen” coincides with the subspace where “man” differs from “woman” (because both contrasts encode gender), so subtraction-and-addition lands you nearby. This emergent linear structure is what made word2vec famous.

Concrete Walkthrough

SETUP: vocabulary = 50,000 words; embedding dim = 300.
       Corpus: 1.6 billion words from Google News.

TRAINING:
  For each window of 5 words: 
    Take the center word; pair it with each of 4 context words.
    For each pair: 1 positive + 5 negative samples.
    Update W_in[center] and W_out[positive], W_out[negatives] via gradient descent.
  
  Total pairs: ~1.6B * 4 = 6.4B context pairs.
  Total backprop steps: ~38B (with 5 negatives each).
  
TIME: ~1 day on a single multi-core CPU.

OUTPUT: W_in is 50,000 x 300 = 15M parameters. The embeddings.

EXAMPLE LOOKUPS (cosine similarity):
  most_similar(W_in[king]):       queen, prince, monarch, kingdom, throne
  most_similar(W_in[paris]):      france, london, capital, european, city
  most_similar(W_in[programming]): coding, software, code, programming-language

ANALOGY:
  vec_paris - vec_france + vec_germany = ?
  Search nearest neighbors: top result = vec_berlin

What’s Clever

The first clever recognition: the hidden layer in earlier neural language models was unnecessary for the embedding objective. Bengio’s 2003 NNLM had a tanh hidden layer; the bulk of compute went there. Word2Vec removes it entirely — just two embedding tables and a dot product. Surprisingly, the resulting embeddings are better, not worse, because the simpler model trains on more data faster.

The second clever move: negative sampling as a softmax approximation. Computing a full softmax over a 50K vocabulary every training step is too slow. Negative sampling replaces it with a binary classification per example — true positives vs randomly-sampled negatives. This is mathematically equivalent (under certain assumptions) to maximizing a PMI-shifted objective, which produces vectors with nearly the same geometry as the full-softmax version.

The third (most foundational) recognition: distributional semantics actually works at scale. The Firth hypothesis — that meaning is captured by co-occurrence — was a 60-year-old linguistic conjecture that nobody had operationalized convincingly. Word2Vec made it concrete: a billion-word corpus + a trivial neural net + a few days of training = vectors with rich semantic structure. This unlocked the entire next decade of representation learning.

Code

# Conceptual minimal skip-gram with negative sampling
 
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class SkipGram(nn.Module):
    def __init__(self, vocab_size, dim):
        super().__init__()
        self.W_in  = nn.Embedding(vocab_size, dim)  # center embeddings
        self.W_out = nn.Embedding(vocab_size, dim)  # context embeddings
 
    def forward(self, center_ids, pos_ids, neg_ids):
        # center: (batch,)   pos: (batch,)   neg: (batch, K)
        c = self.W_in(center_ids)               # (batch, dim)
        p = self.W_out(pos_ids)                 # (batch, dim)
        n = self.W_out(neg_ids)                 # (batch, K, dim)
 
        pos_score = (c * p).sum(-1)             # (batch,)
        neg_score = (c.unsqueeze(1) * n).sum(-1)  # (batch, K)
 
        loss = -F.logsigmoid(pos_score).mean() - F.logsigmoid(-neg_score).mean()
        return loss

Key Sources

word2vec-efficient-estimation-word-representations — the foundational paper
bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding — BERT contextualizes the static word embeddings
sentence-bert-siamese-bert-networks — SBERT extends the recipe to whole sentences
bge-c-pack-general-chinese-embeddings
colbert-late-interaction-retrieval
mteb-massive-text-embedding-benchmark

distributional-hypothesis — the linguistic foundation
negative-sampling — the key training trick
self-supervised-learning — predict-the-context is the canonical SSL pattern
sentence-embeddings — the sentence-level extension
tokenization — affects vocabulary; modern systems use BPE instead of word-level

Open Questions

Polysemy: word2vec gives one vector per word — “bank” (river bank, financial bank) collapses. ELMo and BERT solve this with contextual embeddings, but at much higher inference cost.
Multilingual alignment: how to learn embeddings such that the same word in two languages maps to similar vectors? Bilingual word embeddings, multilingual BERT, then cross-lingual contrastive learning.
Subword: FastText extends word2vec to handle morphologically rich languages and OOV words via character n-grams. BPE is the modern replacement that works for all languages.
Bias: word embeddings absorb whatever biases are in their training corpus (man:doctor :: woman:nurse). Mitigation is an active research area.

ML Wiki

Explorer

Word Embeddings

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Code

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Word Embeddings

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Code

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks