Definition
Self-attention is the core operation of the Transformer. For every token in a sequence, the model produces three learned vectors: a query, a key, and a value. The attention weight between two tokens is the dot product of one token's query with another's key, scaled and passed through a softmax so the weights sum to one. Each token's new representation is then the weighted sum of all value vectors. In plain terms, every token gets to look at every other token and decide how much each one matters to its own meaning.
Scaled dot-product and multiple heads
The dot products are divided by the square root of the key dimension before the softmax; this scaling prevents the values from growing so large that gradients vanish. Transformers run several attention operations in parallel as separate heads, each free to focus on a different relationship, such as syntax in one head and long-range reference in another. The heads' outputs are concatenated and projected back to the model dimension. Memory-saving variants such as grouped-query attention let several query heads share key and value projections.
The cost that shapes deployment
Standard self-attention compares every token with every other token, so compute and memory scale with the square of the sequence length. This quadratic cost is the main reason long-context models are expensive to run and why the key-value cache dominates VRAM during inference, a constraint that matters when you self-host on hardware you own.
See also positional encoding, which gives attention its sense of token order, and layer normalization.
In Simple Terms
Self-attention is the core operation of the Transformer. For every token in a sequence, the model produces three learned vectors: a query, a key, and…
