Definition
Self-attention treats a sequence as an unordered set, so a Transformer needs an explicit signal telling it where each token sits. That signal is the positional encoding. The original Transformer added fixed sinusoidal vectors to the token embeddings. Most current LLMs instead use Rotary Position Embedding (RoPE), introduced in the 2021 RoFormer paper by Su and colleagues, because it encodes relative position cleanly and extrapolates better to longer sequences.
How RoPE works
RoPE does not add a position vector. Instead it rotates each query and key vector by an angle proportional to the token's position, applying a different rotation frequency to each pair of dimensions. When two rotated vectors are compared in self-attention, the dot product depends only on the difference between their positions, not their absolute positions. This gives the model a built-in sense of relative distance and a smooth decay of attention as tokens grow further apart.
Why it matters for self-hosting
RoPE's rotation frequencies can be rescaled after training to stretch a model's usable context window without full retraining, the basis of techniques like NTK and YaRN scaling. If you run a local model and want longer context than it shipped with, these RoPE adjustments are often what makes it possible on the VRAM you have.
Related entries: self-attention and grouped-query attention.
In Simple Terms
Self-attention treats a sequence as an unordered set, so a Transformer needs an explicit signal telling it where each token sits. That signal is the…
