Hidden Markov Models

The Problem

You observe a sequence of measurements over time, but you do not directly see the underlying state that produced them. A speech signal is the acoustic measurements; the underlying state is which phoneme is being uttered. A GPS trace is the measured positions; the underlying state is which road segment the vehicle is on. A protein sequence is the amino acids; the state is the secondary structure. Without a model of how the hidden states evolve and how they emit observations, you have no principled way to recover the state sequence from observations.

Before HMMs (and before Baum-Welch / Viterbi), engineers used heuristics: nearest-neighbor matching, template matching, manual rules. These broke whenever observations got noisy, sparse, or ambiguous.

The Key Insight

A Hidden Markov Model factors the problem into two probability families:

Transition probabilities P(state at t+1 | state at t). The hidden state evolves as a Markov chain: only the current state matters for predicting the next.
Emission probabilities P(observation at t | state at t). Each state emits observations with some characteristic distribution.

Given the two probability tables and a sequence of observations, you can compute the most likely sequence of hidden states using Viterbi (dynamic programming), or update beliefs in real time using the forward algorithm.

Mechanism in Plain English

Define a finite (or countable) set of hidden states.
Specify how states transition: a matrix A where A[i,j] = P(next state = j | current state = i).
Specify how states emit observations: for each state i, an emission distribution P(obs | state i). For discrete observations this is a matrix; for continuous, a Gaussian or other parametric form.
Specify the initial distribution: pi[i] = P(starting in state i).
Given a sequence of observations o_1, o_2, …, o_T, run Viterbi to find the most likely state sequence, or run forward-backward to compute marginal posteriors P(state at t | all observations).

Math with Translation

The joint probability of a state sequence s = (s_1, …, s_T) and observations o = (o_1, …, o_T):

$P (s, o) = π (s_{1}) \cdot \prod_{t = 2}^{T} A (s_{t - 1}, s_{t}) \cdot \prod_{t = 1}^{T} B (s_{t}, o_{t})$

pi(s_1) — probability of starting in state s_1
A(s_{t-1}, s_t) — transition probability from s_{t-1} to s_t
B(s_t, o_t) — emission probability of observation o_t given state s_t

The Viterbi recursion finds argmax_s P(s, o) in O(T * N^2) time where N is the number of states:

$δ_{t} (j) = max_{i} [δ_{t - 1} (i) \cdot A (i, j)] \cdot B (j, o_{t})$

delta_t(j) — score of the best path ending in state j at time t
The argmax_i is stored as a backpointer for reconstruction.

Concrete Walkthrough

Two-state weather model. States: {rainy, sunny}. Observations: {umbrella, no-umbrella}.

Transition matrix A:
              rainy   sunny
  rainy       0.7     0.3
  sunny       0.4     0.6

Emission matrix B:
              umbrella  no-umbrella
  rainy       0.9       0.1
  sunny       0.2       0.8

Initial: pi(rainy) = 0.5, pi(sunny) = 0.5.

Observations: [umbrella, umbrella, no-umbrella]

Viterbi:
  delta_1(rainy) = 0.5 * 0.9 = 0.45
  delta_1(sunny) = 0.5 * 0.2 = 0.10

  delta_2(rainy) = max(0.45 * 0.7, 0.10 * 0.4) * 0.9 = 0.45 * 0.7 * 0.9 = 0.2835
  delta_2(sunny) = max(0.45 * 0.3, 0.10 * 0.6) * 0.2 = 0.45 * 0.3 * 0.2 = 0.027

  delta_3(rainy) = max(0.2835 * 0.7, 0.027 * 0.4) * 0.1 = 0.0198
  delta_3(sunny) = max(0.2835 * 0.3, 0.027 * 0.6) * 0.8 = 0.068

  Argmax at t=3: sunny.
  Backtrack: at t=2, rainy was the predecessor. At t=1, rainy.
  Most likely state sequence: rainy, rainy, sunny.

What’s Clever

The Markov assumption (only the current state matters) is what makes HMMs tractable. Without it, the joint distribution would have an exponentially growing dependency structure. With it, dynamic programming gives exact inference in linear time.

The factoring of “how states evolve” from “how states emit” is the second clever move. Each can be learned, set, or modified independently. You can keep the transition model fixed (network topology of roads, language model of phonemes) and only learn the emission model from data.

Key Sources

hidden-markov-map-matching-noise-sparseness — application to GPS map matching

map-matching — primary application area in this wiki
viterbi — the dynamic programming algorithm for max-likelihood state recovery

ML Wiki

Explorer

Hidden Markov Models

The Problem

The Key Insight

Mechanism in Plain English

Math with Translation

Concrete Walkthrough

What’s Clever

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Hidden Markov Models

The Problem

The Key Insight

Mechanism in Plain English

Math with Translation

Concrete Walkthrough

What’s Clever

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks