How Transformers Learn Order: Absolute, Relative, and Rotary Positions
Transformers process tokens in parallel and have no built‑in sense of order. Positional encodings inject information about where each token appears in the sequence. The basic idea: each position gets a vector, and that vector is added to the token embedding before entering the transformer. This lets the model distinguish sequences like “dog bites man” from “man bites dog” and learn how order affects meaning. Over time, several approaches to positional encoding have emerged, ranging from fixed sinusoidal schemes to fully learned embeddings, relative encodings, and rotary methods, each with different tradeoffs in flexibility, inductive bias, and length generalization....