Self-Attention

Sovereign AI

Self-attention is the core operation of the Transformer. For every token in a sequence, the model produces three learned vectors: a query, a key, and a value. The attention weight between two tokens is the dot product of one token's query with another's key, scaled and passed through a softmax so the weights sum to one. Each token's new representation is then the weighted sum of all value vectors. In plain terms, every token gets to look at every other token and decide how much each one matters to its own meaning.

Scaled dot-product and multiple heads

The dot products are divided by the square root of the key dimension before the softmax; this scaling prevents the values from growing so large that gradients vanish. Transformers run several attention operations in parallel as separate heads, each free to focus on a different relationship, such as syntax in one head and long-range reference in another. The heads' outputs are concatenated and projected back to the model dimension. Memory-saving variants such as grouped-query attention let several query heads share key and value projections.

The cost that shapes deployment

Standard self-attention compares every token with every other token, so compute and memory scale with the square of the sequence length. This quadratic cost is the main reason long-context models are expensive to run and why the key-value cache dominates VRAM during inference, a constraint that matters when you self-host on hardware you own.

See also positional encoding, which gives attention its sense of token order, and layer normalization.

A Worked Walk-Through

Concretely, the input to a layer is a matrix X with one row per token. Three learned weight matrices project it into queries, keys, and values: Q = X·Wq, K = X·Wk, V = X·Wv. The layer then computes softmax(Q·Kᵀ / √d)·V in one shot: the Q-times-K-transpose product is a table of every token's affinity for every other token, the softmax turns each row into weights that sum to one, and multiplying by V blends each token's output from the tokens it attends to. Nothing in the mechanism is sequential — the entire table is one batch of matrix multiplications, which is precisely what GPUs are built to do fast.

Causal Masking: Why Generators Cannot Peek

Decoder-style LLMs add one constraint: a token may attend only to itself and tokens before it. This causal mask sets every future position's attention score to negative infinity before the softmax, zeroing its weight. The mask is what makes autoregressive generation coherent — the model's prediction for position n provably depends only on positions 1 through n−1 — and it is also what makes the KV cache work: past tokens' keys and values never change as generation proceeds, so they can be computed once and reused for every subsequent token instead of recomputed.

Taming the Quadratic Cost

The n-squared attention table is the architecture's bill, and years of engineering have attacked it from different angles. Exact-computation optimizations reorganize the arithmetic to avoid ever materializing the full attention matrix in slow GPU memory, delivering large speedups with mathematically identical results. Architectural variants change the trade-off itself: multi-query and grouped-query attention shrink the number of key-value heads to cut KV cache memory, and sliding-window schemes restrict each token to a local neighborhood, trading unbounded range for linear cost. For a self-hoster these are not abstractions — they are the difference between a context window that fits on your GPU and one that does not, and they explain why two models of identical parameter count can have wildly different memory appetites. Check a model's attention configuration, not just its size, before assuming it fits your hardware.

Self-attention is the core operation of the Transformer. For every token in a sequence, the model produces three learned vectors: a query, a key, and…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners