The Problem

What does a word “mean,” operationally? Dictionary definitions are circular (defined in terms of other words). Truth-conditional semantics is hard to compute. Symbolic semantic networks (WordNet) are hand-built and incomplete. We need a notion of word meaning that is (a) usable by a computer and (b) automatically learnable from raw text.

The Key Insight

J.R. Firth (1957): “You shall know a word by the company it keeps.” Or more precisely: the meaning of a word is determined by the distribution of words it appears with. Two words with similar distributions have similar meanings. This shifts semantics from a question of definition to a question of co-occurrence statistics.

This is the distributional hypothesis. It is a strong claim — and the success of word2vec, BERT, and modern LLMs is evidence that it’s largely correct, at least for the meaning that matters in NLP.

Mechanism in Plain English

  1. Take a large corpus of text.
  2. For each word, count which other words appear within a context window (e.g., 5 words).
  3. Two words are “similar in meaning” iff their context-word distributions are similar.
  4. Operationalize: build vectors where each dimension is a context word, the value is the co-occurrence count (or a transformation thereof). Words with similar context distributions have similar vectors.

This was the LSA / count-based pipeline before 2013. Word2Vec replaces the explicit count matrix with an implicit one — it predicts contexts directly, and the resulting embeddings are isomorphic to the count-based version (under a PMI-shifted transformation, per Levy & Goldberg 2014).

ASCII Diagram

COUNT-BASED VIEW (LSA-style):

  Co-occurrence counts in 5-word window:
                    quick  brown  fast  black   ate    ...
  fox             [   42     58    19     3     12    ...]
  dog             [    3     12    45    87     8     ...]
  lion            [    2      5    12    65    34     ...]

  Cosine similarity over the count vectors:
    sim(fox, dog) = 0.31
    sim(dog, lion) = 0.79  <- both eat, both are commonly black, etc.
    sim(fox, lion) = 0.22

NEURAL VIEW (Word2Vec):

  No explicit count matrix. Instead, train a model whose hidden states
  are forced to encode the same information through a prediction task.
  The resulting vectors capture the same distributional structure.

Math with Translation

The PMI-shifted matrix interpretation (Levy & Goldberg, 2014):

Word2Vec’s skip-gram with negative sampling implicitly factorizes the matrix:

Where:

  • = probability of seeing word with context .
  • = marginal probabilities.
  • = number of negative samples.
  • The first term is pointwise mutual information (PMI) — a classic distributional similarity measure.

So word2vec is mathematically equivalent to factorizing a shifted PMI matrix. The geometric structure of the embeddings — including the famous analogy property — comes from the structure of this matrix, not from anything special about neural nets.

Concrete Walkthrough

SMALL CORPUS:
  "The quick brown fox jumps over the lazy dog"
  "A swift brown fox darts past the lazy dog"
  "The fast red fox leaps over the sleeping dog"

CONTEXT WINDOW = 2:
  Fox co-occurs with: quick, brown, jumps, swift, brown, darts,
                      fast, red, leaps, the (multiple times)
  Dog co-occurs with: lazy, lazy, sleeping, the (multiple times)
  Cat (not in corpus, but if it were):
                      mouse, milk, sleeping, the, ...

DISTRIBUTIONAL SIMILARITY:
  Fox and Dog share: "the", and via "lazy"/"sleeping" some animal-rest semantics.
  Fox and Cat (in larger corpus) would share: "swift", "leaps", "small".
  
  -> Fox and Cat are MORE similar than Fox and Dog by some measures.

(In the real world, fox-cat similarity is indeed high, capturing predator+small+furred.)

What’s Clever

The clever recognition: meaning is not a property of the word itself, but a property of where the word lives in linguistic space. This sidesteps the philosophical question of “what is meaning?” by replacing it with an empirically tractable one: “what are the statistical patterns of usage?”

The second clever recognition: the hypothesis composes. Words that appear in similar contexts have similar meanings (sentence-level distributional hypothesis). Sentences that appear in similar contexts have similar meanings (paragraph-level). Documents that appear in similar contexts have similar meanings (corpus-level). The same idea applied at different granularities gives word embeddings, sentence embeddings, document embeddings.

The third recognition: the hypothesis applies beyond text. Co-occurrence patterns work for any sequential or graph-structured data: amino acids in proteins (ProtVec), molecules in reactions (Mol2Vec), genes in expression panels (Gene2Vec), products in shopping carts (Item2Vec), POIs in trajectories. Whenever you can define “what appears nearby,” you can apply the distributional hypothesis.

Key Sources

Open Questions

  • Where does the hypothesis break? Function words (the, of), antonyms (good vs bad — they appear in similar contexts but mean opposite things), rare words (insufficient context). All known weaknesses.
  • Compositional meaning: distributional embeddings capture the meaning of words in isolation, but not how they combine. “Hot dog” is not “hot” + “dog.” Modern transformers handle this via attention, but the underlying vectors are still distributional.
  • Causal vs distributional: knowing that “mosquito” co-occurs with “malaria” doesn’t tell you mosquitoes cause malaria. The hypothesis is a similarity claim, not a causal one.
  • Beyond linguistic data: how far does the hypothesis go? It seems to work everywhere we can define a context window — but with what limits?