The Problem

Pre-2013 NLP represented words as one-hot vectors — a 50,000-dim vector with a single 1, the rest zeros. This is fine for indexing but useless for similarity: “cat” and “kitten” have a cosine similarity of zero, the same as “cat” and “kerosene.” Worse, learning anything from one-hots required separately learning relationships for every word pair — an N² problem. Pre-trained dense word representations existed (LSA, latent semantic analysis) but were either too slow to train at scale or didn’t capture rich semantic relationships. The problem: how to assign every word a low-dimensional dense vector that captures its meaning, learnable from raw text alone.

The Key Insight

You shall know a word by the company it keeps (Firth, 1957). Train a model whose objective is to predict surrounding words from a center word (or vice versa). The model’s hidden representations of words — what it’s forced to compute to do this prediction — will end up encoding the words’ co-occurrence patterns. Words that appear in similar contexts (cat, kitten, dog) will get similar vectors. This is the distributional hypothesis, made operational.

Mechanism in Plain English

  1. Tokenize a large text corpus into a sequence of word IDs.
  2. Define a prediction task: given a center word, predict the surrounding words within a window (Skip-gram); or given the surrounding words, predict the center (CBOW).
  3. Build a model: two embedding tables, (input/center) and (output/context). Each table has one row per vocabulary word, each row is a -dimensional vector.
  4. For a training example (center, context), the score is . Train to maximize this for true pairs and minimize for negatives (negative sampling).
  5. After training, is the matrix of word embeddings.

ASCII Diagram

SKIP-GRAM (Word2Vec):

  Sentence: "the quick brown FOX jumps over"
                              |
                       center word (FOX)
                              |
                       look up W_in[FOX]
                              |
                              v
                       300-d vector
                              |
                              v
                  for each context word w in window:
                      score = vector @ W_out[w]
                      softmax over vocabulary
                      train to make true context words high


  After training, W_in rows are the word embeddings.
  Words with similar contexts get similar vectors.

Math with Translation

Skip-gram with negative sampling:

  • = center word, = observed (true) context word.
  • = sigmoid.
  • The first term: maximize the score for the true (center, context) pair.
  • The second term: minimize the score for random negatives (typically ), where negatives are sampled from a unigram distribution raised to the 3/4 power (slightly upweights rare words).

The famous geometric property:

The model never explicitly learns this — but the vector subspace where “king” differs from “queen” coincides with the subspace where “man” differs from “woman” (because both contrasts encode gender), so subtraction-and-addition lands you nearby. This emergent linear structure is what made word2vec famous.

Concrete Walkthrough

SETUP: vocabulary = 50,000 words; embedding dim = 300.
       Corpus: 1.6 billion words from Google News.

TRAINING:
  For each window of 5 words: 
    Take the center word; pair it with each of 4 context words.
    For each pair: 1 positive + 5 negative samples.
    Update W_in[center] and W_out[positive], W_out[negatives] via gradient descent.
  
  Total pairs: ~1.6B * 4 = 6.4B context pairs.
  Total backprop steps: ~38B (with 5 negatives each).
  
TIME: ~1 day on a single multi-core CPU.

OUTPUT: W_in is 50,000 x 300 = 15M parameters. The embeddings.

EXAMPLE LOOKUPS (cosine similarity):
  most_similar(W_in[king]):       queen, prince, monarch, kingdom, throne
  most_similar(W_in[paris]):      france, london, capital, european, city
  most_similar(W_in[programming]): coding, software, code, programming-language

ANALOGY:
  vec_paris - vec_france + vec_germany = ?
  Search nearest neighbors: top result = vec_berlin

What’s Clever

The first clever recognition: the hidden layer in earlier neural language models was unnecessary for the embedding objective. Bengio’s 2003 NNLM had a tanh hidden layer; the bulk of compute went there. Word2Vec removes it entirely — just two embedding tables and a dot product. Surprisingly, the resulting embeddings are better, not worse, because the simpler model trains on more data faster.

The second clever move: negative sampling as a softmax approximation. Computing a full softmax over a 50K vocabulary every training step is too slow. Negative sampling replaces it with a binary classification per example — true positives vs randomly-sampled negatives. This is mathematically equivalent (under certain assumptions) to maximizing a PMI-shifted objective, which produces vectors with nearly the same geometry as the full-softmax version.

The third (most foundational) recognition: distributional semantics actually works at scale. The Firth hypothesis — that meaning is captured by co-occurrence — was a 60-year-old linguistic conjecture that nobody had operationalized convincingly. Word2Vec made it concrete: a billion-word corpus + a trivial neural net + a few days of training = vectors with rich semantic structure. This unlocked the entire next decade of representation learning.

Code

# Conceptual minimal skip-gram with negative sampling
 
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class SkipGram(nn.Module):
    def __init__(self, vocab_size, dim):
        super().__init__()
        self.W_in  = nn.Embedding(vocab_size, dim)  # center embeddings
        self.W_out = nn.Embedding(vocab_size, dim)  # context embeddings
 
    def forward(self, center_ids, pos_ids, neg_ids):
        # center: (batch,)   pos: (batch,)   neg: (batch, K)
        c = self.W_in(center_ids)               # (batch, dim)
        p = self.W_out(pos_ids)                 # (batch, dim)
        n = self.W_out(neg_ids)                 # (batch, K, dim)
 
        pos_score = (c * p).sum(-1)             # (batch,)
        neg_score = (c.unsqueeze(1) * n).sum(-1)  # (batch, K)
 
        loss = -F.logsigmoid(pos_score).mean() - F.logsigmoid(-neg_score).mean()
        return loss

Key Sources

Open Questions

  • Polysemy: word2vec gives one vector per word — “bank” (river bank, financial bank) collapses. ELMo and BERT solve this with contextual embeddings, but at much higher inference cost.
  • Multilingual alignment: how to learn embeddings such that the same word in two languages maps to similar vectors? Bilingual word embeddings, multilingual BERT, then cross-lingual contrastive learning.
  • Subword: FastText extends word2vec to handle morphologically rich languages and OOV words via character n-grams. BPE is the modern replacement that works for all languages.
  • Bias: word embeddings absorb whatever biases are in their training corpus (man:doctor :: woman:nurse). Mitigation is an active research area.