The Problem

Given a probabilistic sequence model (HMM, CRF, n-gram LM with constraints), find the most likely sequence of hidden states given the observations. The naive approach enumerates all possible sequences and picks the highest-probability one. For T time steps and N possible states per step, this is O(N^T) — exponential and intractable past T=20 or so.

The Key Insight

The Markov property — that the next state depends only on the current state — means the optimal path through state j at time t does not depend on which path got to j; it only depends on j’s score so far and the future. Therefore: at each (time, state) cell, store only the best score and a backpointer to its predecessor. Build forward; backtrack to recover.

This is dynamic programming applied to sequence decoding. Time complexity drops from O(N^T) to O(T * N^2): linear in length, quadratic in state count.

Mechanism in Plain English

  1. Initialize: for each possible starting state j, compute delta_1(j) = pi(j) * B(j, o_1).
  2. Recurse: for each time t = 2, …, T and each state j:
    • Compute the score of arriving at j at time t via every possible predecessor i.
    • Keep the maximum: delta_t(j) = max_i [delta_{t-1}(i) * A(i, j)] * B(j, o_t).
    • Save argmax_i as a backpointer.
  3. Termination: at t = T, find the state with max delta_T(j).
  4. Backtrack: follow backpointers from that state back to t = 1 to recover the best sequence.

Math with Translation

Forward recursion:

  • delta_t(j) — best score ending in state j at time t
  • A(i, j) — transition probability from state i to j
  • B(j, o_t) — emission probability of o_t from state j

The argmax_i is the predecessor backpointer:

In log space (numerical stability for long sequences):

This converts products into sums and avoids underflow at long sequence lengths.

Concrete Walkthrough

See hidden-markov-models for a worked weather-model example.

For map matching specifically: states are road segments, observations are GPS points, A is the transition model based on |d_GPS - d_road|, B is the emission model (Gaussian in perpendicular distance to road). The Viterbi recursion runs over a trellis of (time, candidate-segment) cells, with pruning to keep only segments within a radius of each GPS observation.

What’s Clever

The recursion exploits the optimal substructure of the Markov assumption. Without that assumption, the best path through state j at time t depends on the entire history, not just the current state’s score — and dynamic programming would not apply.

The backpointer trick lets you recover the actual sequence after computing only scores. Without it, you would need to store entire paths at each cell, blowing up memory.

Code

import numpy as np
 
def viterbi(obs, A, B, pi):
    T = len(obs)
    N = A.shape[0]
    delta = np.zeros((T, N))
    psi = np.zeros((T, N), dtype=int)
 
    delta[0] = pi * B[:, obs[0]]
    for t in range(1, T):
        for j in range(N):
            scores = delta[t-1] * A[:, j]
            psi[t, j] = np.argmax(scores)
            delta[t, j] = scores[psi[t, j]] * B[j, obs[t]]
 
    path = [np.argmax(delta[-1])]
    for t in range(T-1, 0, -1):
        path.append(psi[t, path[-1]])
    return path[::-1]

Key Sources