The Problem

GPS trajectories are messy. Same trip, two devices, two phone-models, two seconds of clock skew — and the resulting (lat, lon, t) sequences look completely different to most distance metrics. Hausdorff and Frechet distances are brittle to noise. DTW is robust to time warping but breaks under dropout. All three are O(n*m) per pair — at 1M trajectories you cannot pairwise compare them all. The general problem: how to compare trajectories at production scale, robustly to noise, sampling rate, and missing data.

The Key Insight

Treat each trajectory as a sentence whose tokens are spatial cells (S2, H3, or uniform grid). Train a neural network to map the (corrupted) trajectory sequence to a fixed-dimensional vector. The training signal: corrupt clean trajectories in different ways and force the model to map them all to the same vector. After training, similar trajectories — even if noisy or downsampled differently — end up in nearby points of vector space. Comparison becomes O(d) cosine instead of O(n*m) DTW.

Mechanism in Plain English

  1. Discretize: tokenize each GPS point as a cell ID (S2 level 14-16, H3 resolution 8-10, or uniform grid). A trajectory becomes a sequence of cell IDs, like a sentence is a sequence of word IDs.
  2. Encoder: a sequence model (LSTM, GRU, transformer) processes the cell ID sequence and outputs a fixed-dimensional vector.
  3. Training signal: from a clean trajectory , generate two corrupted versions by dropping points and adding spatial noise. Train so that and are close. Use either reconstruction (encoder-decoder, t2vec-style) or contrastive (modern variants).
  4. Inference: encode any new trajectory once. Compare via cosine. Index with FAISS for fast nearest-neighbor lookup.

ASCII Diagram

RAW GPS TRAJECTORY:
  [(lat1, lon1, t1), (lat2, lon2, t2), ..., (latN, lonN, tN)]

  |
  v  S2 / H3 cell tokenization

CELL ID SEQUENCE:
  [c12, c13, c14, c20, c27, c33, c40]

  |
  v  Sequence encoder (LSTM, transformer)

EMBEDDING (256-dim):
  [0.42, -0.18, 0.93, ..., 0.07]

  |
  v  Compare via cosine to indexed trajectory database

NEAREST TRAJECTORIES (similar paths):
  [trip-43821 (cosine 0.98), trip-12993 (0.95), ...]

Math with Translation

Reconstruction-based (t2vec-style):

The encoder takes a (corrupted) sequence and produces a vector . The decoder takes and tries to reconstruct the clean sequence . Loss = cross-entropy on the reconstructed cell IDs.

  • The encoder must compress enough to reconstruct , so ends up encoding the underlying clean path.

Contrastive-based (modern variants):

Generate two augmentations of the same trajectory . Train so is closer to than to embeddings of other trajectories.

Same InfoNCE form as in sentence embeddings. Augmentation strategies: dropout, spatial noise, sub-trajectory sampling, time-warping.

Concrete Walkthrough

A "TAXI TRIP FROM AIRPORT TO HOTEL DISTRICT" example:

CLEAN TRAJECTORY (sampled every 5s):
  cells = [c100, c101, c108, c116, c124, c133, c142, c150]

CORRUPTION 1 (drop 60%):
  T1 = [c100, c108, c133, c150]

CORRUPTION 2 (drop 50% + spatial noise):
  T2 = [c100, c109, c116, c134, c142]   (c108 -> c109 = neighbor cell)

ENCODE T1 -> h1 (256-d vector)
ENCODE T2 -> h2 (256-d vector)

TRAINING:
  Pull h1 and h2 together (they came from the same trip).
  Push h1 and h_other apart (other trips have different paths).

INFERENCE QUERY:
  New raw trajectory T_query (also from airport-to-hotel area).
  Compute h_query.
  Search index of 1M historical trip embeddings.
  Top match: a previous airport-to-hotel trip.
  
  Cosine: ~0.95 with similar trips. ~0.30 with unrelated trips.

What’s Clever

The first clever recognition: trajectory similarity is hard because the noise model and the metric are tangled. DTW handles temporal misalignment but not dropout. Frechet handles spatial offset but not sampling rate. Each metric encodes a fixed assumption. By learning the embedding from corruption-augmented data, the model learns the noise model implicitly — and the metric (cosine) is decoupled from it.

The second clever move: discretization to cells gives a finite vocabulary. Continuous (lat, lon) is a non-stationary distribution; LSTMs/transformers struggle with it. Discretizing to S2/H3 cells turns the problem into discrete sequence modeling, where decades of NLP techniques apply. The cell IDs are the trajectory’s “tokens.”

The third recognition: the same recipe works at multiple scales. With small cells (S2 level 18, ~80m), embeddings capture detailed routes. With larger cells (S2 level 14, ~1.2km), they capture coarse patterns (commute origin/destination). You can train one model per scale or use multi-resolution cell embeddings.

Key Sources

Open Questions

  • Cell granularity: too small → vocabulary explosion (sparse training, OOV problems); too large → loss of spatial detail. S2 level 14-16 is the modern sweet spot but task-dependent.
  • Multi-modal trajectories: how to embed trips combining walking, driving, public transit? Each mode has different speed and pattern statistics.
  • Time-aware embedding: t2vec ignores wall-clock time; a 9am commute and a 9pm commute along the same path have similar embeddings. Adding time features helps for some applications, hurts for others.
  • Long trajectories: trips over 1000 cells stress LSTM memory; transformers help but cost more. Hierarchical encoding (segment → trip) is one direction.
  • Cross-city transfer: a model trained on Jakarta trajectories may not transfer to Manila. How to share knowledge across cities while preserving local geometry?