t2vec: Deep Representation Learning for Trajectory Similarity Computation

Concepts: trajectory-embeddings | encoder-decoder | self-supervised-learning | contrastive-learning Builds on: sequence-to-sequence learning (Sutskever 2014) and word2vec (Mikolov 2013) — t2vec borrows the “encode a sequence into a fixed vector” recipe from NMT and the “you get embeddings by predicting context” instinct from word2vec Leads to: trajectory clustering, mobility-pattern mining, ETA prediction backbones, and the broader set of geospatial sequence-embedding methods

Comparing two GPS trajectories is computationally painful. Hausdorff distance, Frechet distance, DTW — all are O(n*m) on trajectory length and require full pairwise pixel comparisons. Worse, they break under noise: a trajectory sampled every 5 seconds and one sampled every 30 seconds along the same road look completely different to these metrics. t2vec (Li et al., ICDE 2018) reframes the problem: instead of computing distances between raw traces, learn an embedding function that maps any trajectory to a vector, and let cosine distance do the rest. The embedding is trained so that noisy/downsampled versions of the same trip land near each other.

The core idea

The analogy: Word2vec embeds words such that words appearing in similar contexts get similar vectors. t2vec embeds trajectories such that trajectories along the same path get similar vectors — even if one is sampled densely on a sunny day and the other is sparsely sampled in a tunnel. The model never sees clean ground truth; it only sees pairs of corrupted versions of the same trip and learns that the underlying path is what they share.

Three pieces:

Discretization. Tokenize each GPS point as a grid cell ID (the paper uses uniform-size square cells; modern systems use S2 or H3). A trajectory becomes a sequence of cell IDs, exactly like a sentence is a sequence of token IDs.
Sequence-to-sequence denoising autoencoder. A bidirectional LSTM encodes the (corrupted) input trajectory into a fixed-dimensional vector. A decoder LSTM reconstructs a clean version of the trajectory token by token. The fixed-dim hidden state at the encoder boundary is the trajectory embedding.
Self-supervised corruption. No labeled “similar trajectory” pairs are available, so the training signal comes from corruption: take a clean trajectory $T$ , generate two corrupted versions $T_{1}$ and $T_{2}$ by dropping points and adding spatial noise, train the model to encode $T_{1}$ to a representation that decodes to $T$ . The decoder forces the embedding to capture path identity, not surface noise.

Walkthrough

A worked example:

ORIGINAL TRIP (taxi from a bus station to the airport):
  Cell IDs: [c12, c13, c14, c20, c27, c33, c40, c41, c48]
  (sampled every 5 seconds along the road)

CORRUPTION 1 (dropping rate 0.6 -> keep ~40% of points):
  T1 = [c12, c14, c33, c41]

CORRUPTION 2 (drop 0.6 + spatial noise of one cell):
  T2 = [c13, c14, c34, c41, c48]   (c33 -> c34 = small lateral noise)

TRAINING:
  Encode T1 -> hidden vector h1 (256-dim)
  Decode h1 -> sequence prediction; loss = cross-entropy vs original T
  Same for T2.

  After training, embed any new trajectory T_new by encoding it.
  Compute similarity to query Q via cosine(embed(T_new), embed(Q)).

Compare to traditional metrics on a similarity task:

QUERY: a trip from station A to station B (clean reference path)
DATABASE: 1M GPS trajectories of varying quality

Goal: find the K trajectories most likely to be along the same path.

DTW / Frechet:    O(K * n * m)  - DTW on every pair, slow
                   Sensitive to noise + sampling rate.
                   Hours per query at K=1M.

t2vec:            O(n + log K)  - encode query once, ANN search
                   Robust to noise + dropout (it was trained on it).
                   Milliseconds per query.

The training data is just a corpus of GPS trajectories — no labels needed. The model learns from corruption alone what “the same trajectory” means.

What’s clever — find the instinct

The clever recognition: trajectory similarity is hard because the metric and the noise model are tangled together. DTW handles temporal misalignment but assumes both sequences are dense. Frechet handles spatial offset but breaks on dropout. Hausdorff is O(n*m) and barely robust. Each metric encodes a fixed assumption about the noise model.

t2vec’s instinct: let the embedding learn the noise model. Show the model many corruption patterns during training; the encoder will learn to map corrupted trajectories to the underlying path. At inference time, comparison is just cosine distance — fast, parallelizable, indexable.

“We propose to use a deep representation learning model to learn fixed-length vector representations of variable-length trajectories such that the original similarity is well preserved.”

The second clever move: the decoder is the regularizer. The encoder’s hidden state has to contain enough information for the decoder to reconstruct the clean trajectory token by token. A trivial encoder (e.g., one that outputs a constant) would fail this objective. The reconstruction loss forces the embedding to be informative about path identity.

The third clever move: leveraging spatial proximity to define corruption. The paper drops points uniformly and spatially perturbs cells (e.g., shift to neighboring cells). This corruption distribution roughly matches what real GPS noise looks like — so a model trained on it generalizes to real-world traces with little effort.

Does it work? What breaks?

Trajectory similarity benchmark (Porto taxi data, 1M trips):

Method	mean rank (lower is better)
Hausdorff	350.4
DTW	87.1
EDR	39.0
t2vec (256-dim)	3.3
t2vec (128-dim)	5.1

The mean rank measures: given a query trajectory and a corrupted version of it, where does the corrupted version rank in similarity score? Lower is better. t2vec is two orders of magnitude better than DTW.

Speed:

DTW (O(n*m), n=m=200): ~40,000 ops per pair x 1M pairs = 40e9 ops
                       ~30 seconds per query on a CPU.

t2vec encoding:        single forward pass through the LSTM.
                       ~5ms encode + ~1ms ANN lookup.
                       Total: ~6ms per query.

Speedup: 5000x.

What breaks:

Cold start. Need a corpus of trajectories from the target domain to pretrain. The paper uses Porto taxi data; transfer to a different city’s road network may need fine-tuning.
Path identity vs. user identity. t2vec learns paths, not users. Two trips by different drivers along the same road get the same embedding. For mobility prediction or user identification, you need to combine t2vec with user embeddings.
Cell granularity. The grid cell size is a critical hyperparameter. Too small and rare cells dominate vocabulary; too large and you lose spatial detail. S2 level 14-16 is the modern sweet spot.
Out-of-vocabulary cells. If a query trip enters a cell not in the training corpus, the encoder has no embedding. Handled by mapping to a special UNK token, which leaks signal.

So what?

t2vec is the canonical baseline for any “embed a trajectory” task. Modern variants — Trajectory Transformer, GeoSAN, MTrajRec — use transformers instead of LSTMs and self-supervised objectives beyond reconstruction (contrastive, masked-cell, masked-segment), but the core recipe is t2vec’s: tokenize as cells, train a sequence-to-vector model, query via ANN.

For Saikat’s work on map-matching, snap-to-road, and POI dedup, t2vec-style embeddings are the natural way to compare trajectories beyond pairwise DTW. Examples:

Snap-to-road validation: compare the embedding of a raw GPS trace to embeddings of candidate road-aligned versions; pick the closest.
POI dedup: when two POIs have similar trajectory-of-visit signatures (same set of users at similar times), they may be the same place. Embed the visit trajectories, cluster.
Address normalization: a route ending at “Jl Sudirman 23” and another at “Sudirman Street No 23” should embed to similar vectors if both are followed by the same final-mile pattern.

The deeper principle: whenever you have a domain-specific similarity that’s expensive to compute, ask whether you can replace it with a learned embedding plus cosine. This is what BERT did for sentence similarity, ColBERT for passage retrieval, t2vec for trajectories. The pattern is general.

“Our model can capture the underlying route information of a trajectory without being affected by sampling rates or noise.”

For the on-call engineer: if your trajectory-similarity workload is more than ~10K pairs per query, replace your DTW pipeline with t2vec or a transformer descendant. The training cost amortizes in days.

Connections

trajectory-embeddings — t2vec is the foundational paper
encoder-decoder — uses a seq2seq architecture with LSTM encoder and decoder
self-supervised-learning — corruption-and-reconstruct is a SSL pattern
contrastive-learning — adjacent in spirit; later trajectory work uses contrastive losses explicitly
hidden-markov-map-matching-noise-sparseness — HMM map matching solves a related problem (snap to road) at the per-trip level; t2vec works at the bulk-similarity level
map-matching — snap-to-road outputs are often the input to t2vec to clean up noise
tokenization — cell-ID discretization is a tokenization choice analogous to BPE for text

Citation

Li, X., Zhao, K., Cong, G., Jensen, C. S., & Wei, W. (2018). Deep Representation Learning for Trajectory Similarity Computation. ICDE 2018. https://ieeexplore.ieee.org/document/8509283

ML Wiki

Explorer