Transformer

Sovereign AI

The Transformer is a neural network architecture introduced in the 2017 paper Attention Is All You Need by Vaswani and colleagues at Google. It dispenses with the recurrence and convolutions used by earlier sequence models and relies entirely on self-attention to relate every token to every other token. Because attention can be computed in parallel across a whole sequence, Transformers train far faster on modern hardware than the recurrent networks they replaced, which is why they underpin nearly every large language model a sovereign operator might run locally.

How a Transformer is built

The original design is an encoder-decoder stack, though most generative LLMs use a decoder-only variant. Each layer contains two sub-blocks: a multi-head self-attention block and a position-wise feed-forward network. Every sub-block is wrapped in a residual connection and a layer normalization step, which keep gradients stable as depth grows. Because attention itself is order-agnostic, the model needs a positional encoding to know where each token sits in the sequence.

What attention actually computes

Inside the attention block, each token's representation is projected into three vectors: a query, a key, and a value. The query of the token being processed is compared against the keys of every other token; the resulting scores, normalized into weights, decide how much of each token's value vector flows into the output. In plain terms, every token asks "which other tokens matter to me right now?" and blends their information accordingly. Multi-head attention runs this process several times in parallel with different learned projections, letting one head track syntax while another tracks long-range references — then concatenates the results. The feed-forward network that follows is where much of the model's factual knowledge is thought to live; it typically holds the majority of the parameters in each layer.

Decoder-only generation and its costs

A decoder-only LLM generates text autoregressively: it attends over everything written so far, predicts one token, appends it, and repeats. Two practical consequences follow. First, attention cost grows quadratically with sequence length during the initial prompt processing, which is why very long context windows are expensive. Second, generation would be crushingly redundant without the KV cache, which stores each token's computed keys and values so they are never recomputed. These two facts — quadratic prompt cost and linear-but-memory-hungry generation — govern almost every performance number you see when running models on your own hardware.

Why it matters for sovereignty

Understanding the Transformer is the entry point to running models you control rather than renting them. Architectural choices like grouped-query attention directly determine how much VRAM a model needs, which decides whether a given model fits on hardware you own. The architecture's parallelism is also why consumer GPUs — designed to do the same operation across thousands of pixels — turned out to be nearly ideal LLM engines: a Transformer's core workload is huge matrix multiplications, exactly what that silicon does best. When you read a model card, the layer count, head configuration, and hidden size are not trivia; they are the specification that tells you whether the model fits your machine and how fast it will run. One further property explains the last decade of AI: Transformers scale remarkably smoothly. Making them deeper and wider, and feeding them more data and compute, has improved capability in a predictable way that earlier architectures never showed — which is why the same basic design spans everything from a model on a phone to frontier systems, differing mainly in size rather than in kind. For a self-hoster, that continuity is good news: the concepts here apply unchanged to every open-weight model you will ever download.

For practical deployment of these models on your own hardware, see our work on self-hosted inference, and explore related entries such as self-attention and backpropagation to understand how Transformers learn.

The Transformer is a neural network architecture introduced in the 2017 paper Attention Is All You Need by Vaswani and colleagues at Google. It dispenses…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners