The Problem
You observe a sequence of measurements over time, but you do not directly see the underlying state that produced them. A speech signal is the acoustic measurements; the underlying state is which phoneme is being uttered. A GPS trace is the measured positions; the underlying state is which road segment the vehicle is on. A protein sequence is the amino acids; the state is the secondary structure. Without a model of how the hidden states evolve and how they emit observations, you have no principled way to recover the state sequence from observations.
Before HMMs (and before Baum-Welch / Viterbi), engineers used heuristics: nearest-neighbor matching, template matching, manual rules. These broke whenever observations got noisy, sparse, or ambiguous.
The Key Insight
A Hidden Markov Model factors the problem into two probability families:
- Transition probabilities P(state at t+1 | state at t). The hidden state evolves as a Markov chain: only the current state matters for predicting the next.
- Emission probabilities P(observation at t | state at t). Each state emits observations with some characteristic distribution.
Given the two probability tables and a sequence of observations, you can compute the most likely sequence of hidden states using Viterbi (dynamic programming), or update beliefs in real time using the forward algorithm.
Mechanism in Plain English
- Define a finite (or countable) set of hidden states.
- Specify how states transition: a matrix A where A[i,j] = P(next state = j | current state = i).
- Specify how states emit observations: for each state i, an emission distribution P(obs | state i). For discrete observations this is a matrix; for continuous, a Gaussian or other parametric form.
- Specify the initial distribution: pi[i] = P(starting in state i).
- Given a sequence of observations o_1, o_2, …, o_T, run Viterbi to find the most likely state sequence, or run forward-backward to compute marginal posteriors P(state at t | all observations).
Math with Translation
The joint probability of a state sequence s = (s_1, …, s_T) and observations o = (o_1, …, o_T):
- pi(s_1) — probability of starting in state s_1
- A(s_{t-1}, s_t) — transition probability from s_{t-1} to s_t
- B(s_t, o_t) — emission probability of observation o_t given state s_t
The Viterbi recursion finds argmax_s P(s, o) in O(T * N^2) time where N is the number of states:
- delta_t(j) — score of the best path ending in state j at time t
- The argmax_i is stored as a backpointer for reconstruction.
Concrete Walkthrough
Two-state weather model. States: {rainy, sunny}. Observations: {umbrella, no-umbrella}.
Transition matrix A:
rainy sunny
rainy 0.7 0.3
sunny 0.4 0.6
Emission matrix B:
umbrella no-umbrella
rainy 0.9 0.1
sunny 0.2 0.8
Initial: pi(rainy) = 0.5, pi(sunny) = 0.5.
Observations: [umbrella, umbrella, no-umbrella]
Viterbi:
delta_1(rainy) = 0.5 * 0.9 = 0.45
delta_1(sunny) = 0.5 * 0.2 = 0.10
delta_2(rainy) = max(0.45 * 0.7, 0.10 * 0.4) * 0.9 = 0.45 * 0.7 * 0.9 = 0.2835
delta_2(sunny) = max(0.45 * 0.3, 0.10 * 0.6) * 0.2 = 0.45 * 0.3 * 0.2 = 0.027
delta_3(rainy) = max(0.2835 * 0.7, 0.027 * 0.4) * 0.1 = 0.0198
delta_3(sunny) = max(0.2835 * 0.3, 0.027 * 0.6) * 0.8 = 0.068
Argmax at t=3: sunny.
Backtrack: at t=2, rainy was the predecessor. At t=1, rainy.
Most likely state sequence: rainy, rainy, sunny.
What’s Clever
The Markov assumption (only the current state matters) is what makes HMMs tractable. Without it, the joint distribution would have an exponentially growing dependency structure. With it, dynamic programming gives exact inference in linear time.
The factoring of “how states evolve” from “how states emit” is the second clever move. Each can be learned, set, or modified independently. You can keep the transition model fixed (network topology of roads, language model of phonemes) and only learn the emission model from data.
Key Sources
- hidden-markov-map-matching-noise-sparseness — application to GPS map matching
Related Concepts
- map-matching — primary application area in this wiki
- viterbi — the dynamic programming algorithm for max-likelihood state recovery