The Problem
Pre-2013 NLP represented words as one-hot vectors — a 50,000-dim vector with a single 1, the rest zeros. This is fine for indexing but useless for similarity: “cat” and “kitten” have a cosine similarity of zero, the same as “cat” and “kerosene.” Worse, learning anything from one-hots required separately learning relationships for every word pair — an N² problem. Pre-trained dense word representations existed (LSA, latent semantic analysis) but were either too slow to train at scale or didn’t capture rich semantic relationships. The problem: how to assign every word a low-dimensional dense vector that captures its meaning, learnable from raw text alone.
The Key Insight
You shall know a word by the company it keeps (Firth, 1957). Train a model whose objective is to predict surrounding words from a center word (or vice versa). The model’s hidden representations of words — what it’s forced to compute to do this prediction — will end up encoding the words’ co-occurrence patterns. Words that appear in similar contexts (cat, kitten, dog) will get similar vectors. This is the distributional hypothesis, made operational.
Mechanism in Plain English
- Tokenize a large text corpus into a sequence of word IDs.
- Define a prediction task: given a center word, predict the surrounding words within a window (Skip-gram); or given the surrounding words, predict the center (CBOW).
- Build a model: two embedding tables, (input/center) and (output/context). Each table has one row per vocabulary word, each row is a -dimensional vector.
- For a training example (center, context), the score is . Train to maximize this for true pairs and minimize for negatives (negative sampling).
- After training, is the matrix of word embeddings.
ASCII Diagram
SKIP-GRAM (Word2Vec):
Sentence: "the quick brown FOX jumps over"
|
center word (FOX)
|
look up W_in[FOX]
|
v
300-d vector
|
v
for each context word w in window:
score = vector @ W_out[w]
softmax over vocabulary
train to make true context words high
After training, W_in rows are the word embeddings.
Words with similar contexts get similar vectors.
Math with Translation
Skip-gram with negative sampling:
- = center word, = observed (true) context word.
- = sigmoid.
- The first term: maximize the score for the true (center, context) pair.
- The second term: minimize the score for random negatives (typically ), where negatives are sampled from a unigram distribution raised to the 3/4 power (slightly upweights rare words).
The famous geometric property:
The model never explicitly learns this — but the vector subspace where “king” differs from “queen” coincides with the subspace where “man” differs from “woman” (because both contrasts encode gender), so subtraction-and-addition lands you nearby. This emergent linear structure is what made word2vec famous.
Concrete Walkthrough
SETUP: vocabulary = 50,000 words; embedding dim = 300.
Corpus: 1.6 billion words from Google News.
TRAINING:
For each window of 5 words:
Take the center word; pair it with each of 4 context words.
For each pair: 1 positive + 5 negative samples.
Update W_in[center] and W_out[positive], W_out[negatives] via gradient descent.
Total pairs: ~1.6B * 4 = 6.4B context pairs.
Total backprop steps: ~38B (with 5 negatives each).
TIME: ~1 day on a single multi-core CPU.
OUTPUT: W_in is 50,000 x 300 = 15M parameters. The embeddings.
EXAMPLE LOOKUPS (cosine similarity):
most_similar(W_in[king]): queen, prince, monarch, kingdom, throne
most_similar(W_in[paris]): france, london, capital, european, city
most_similar(W_in[programming]): coding, software, code, programming-language
ANALOGY:
vec_paris - vec_france + vec_germany = ?
Search nearest neighbors: top result = vec_berlin
What’s Clever
The first clever recognition: the hidden layer in earlier neural language models was unnecessary for the embedding objective. Bengio’s 2003 NNLM had a tanh hidden layer; the bulk of compute went there. Word2Vec removes it entirely — just two embedding tables and a dot product. Surprisingly, the resulting embeddings are better, not worse, because the simpler model trains on more data faster.
The second clever move: negative sampling as a softmax approximation. Computing a full softmax over a 50K vocabulary every training step is too slow. Negative sampling replaces it with a binary classification per example — true positives vs randomly-sampled negatives. This is mathematically equivalent (under certain assumptions) to maximizing a PMI-shifted objective, which produces vectors with nearly the same geometry as the full-softmax version.
The third (most foundational) recognition: distributional semantics actually works at scale. The Firth hypothesis — that meaning is captured by co-occurrence — was a 60-year-old linguistic conjecture that nobody had operationalized convincingly. Word2Vec made it concrete: a billion-word corpus + a trivial neural net + a few days of training = vectors with rich semantic structure. This unlocked the entire next decade of representation learning.
Code
# Conceptual minimal skip-gram with negative sampling
import torch
import torch.nn as nn
import torch.nn.functional as F
class SkipGram(nn.Module):
def __init__(self, vocab_size, dim):
super().__init__()
self.W_in = nn.Embedding(vocab_size, dim) # center embeddings
self.W_out = nn.Embedding(vocab_size, dim) # context embeddings
def forward(self, center_ids, pos_ids, neg_ids):
# center: (batch,) pos: (batch,) neg: (batch, K)
c = self.W_in(center_ids) # (batch, dim)
p = self.W_out(pos_ids) # (batch, dim)
n = self.W_out(neg_ids) # (batch, K, dim)
pos_score = (c * p).sum(-1) # (batch,)
neg_score = (c.unsqueeze(1) * n).sum(-1) # (batch, K)
loss = -F.logsigmoid(pos_score).mean() - F.logsigmoid(-neg_score).mean()
return lossKey Sources
-
word2vec-efficient-estimation-word-representations — the foundational paper
-
bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding — BERT contextualizes the static word embeddings
-
sentence-bert-siamese-bert-networks — SBERT extends the recipe to whole sentences
Related Concepts
- distributional-hypothesis — the linguistic foundation
- negative-sampling — the key training trick
- self-supervised-learning — predict-the-context is the canonical SSL pattern
- sentence-embeddings — the sentence-level extension
- tokenization — affects vocabulary; modern systems use BPE instead of word-level
Open Questions
- Polysemy: word2vec gives one vector per word — “bank” (river bank, financial bank) collapses. ELMo and BERT solve this with contextual embeddings, but at much higher inference cost.
- Multilingual alignment: how to learn embeddings such that the same word in two languages maps to similar vectors? Bilingual word embeddings, multilingual BERT, then cross-lingual contrastive learning.
- Subword: FastText extends word2vec to handle morphologically rich languages and OOV words via character n-grams. BPE is the modern replacement that works for all languages.
- Bias: word embeddings absorb whatever biases are in their training corpus (man:doctor :: woman:nurse). Mitigation is an active research area.