Memory Wall

Sovereign AI

The memory wall is the long-running phenomenon in which processor speed has improved far faster than memory bandwidth and latency, so that computer systems are increasingly limited by how fast they can move data rather than how fast they can compute. The term was coined by Wulf and McKee in their 1994 paper "Hitting the Memory Wall," which observed something uncomfortable: the gap is the difference between two diverging exponentials, so it does not merely persist — it compounds. Three decades later, the memory wall is the single most important concept for anyone running AI workloads on their own hardware.

The diverging exponentials

Through the 1990s and 2000s, processor performance improved on the order of 50–60% per year while DRAM latency improved only about 7% per year. Compounded over decades, the relative gap became enormous. Architects fought back with the cache hierarchy — small, fast memories stacked between the cores and DRAM — plus prefetching, out-of-order execution, and multithreading, all of which are elaborate machinery for hiding memory latency rather than eliminating it. Modern accelerators extend the same fight with high-bandwidth memory (HBM) stacked physically next to the compute die. Every one of these measures buys time; none repeals the underlying divergence. The arithmetic units of a modern GPU can consume operands far faster than any memory system can deliver them, so expensive silicon spends much of its life stalled, waiting for bytes.

Why the wall defines the AI era

Large-model inference is almost a worst-case workload for this imbalance. Generating each token requires streaming essentially all of the model's weights out of memory while performing comparatively little arithmetic per byte moved — a low-arithmetic-intensity, memory-bound pattern. The practical consequence is blunt: token-generation speed on a local machine is usually set by memory bandwidth and total VRAM, not by headline TFLOPS. A rough ceiling is simply bandwidth divided by model size — a 30 GB model on a 900 GB/s card cannot exceed about 30 tokens per second per stream no matter how many teraflops the spec sheet boasts. This is also why quantization helps so dramatically: shrinking weights from 16 bits to 4 bits cuts the bytes that must cross the wall per token by 4×, and the speedup follows almost linearly.

Living with the wall as a self-hoster

For the sovereign operator building a local AI box — the same instinct that puts a Bitcoin node on your own shelf instead of someone else's cloud — the memory wall should drive purchasing decisions. Memory bandwidth and capacity predict single-user inference speed better than compute ratings; unified-memory machines with wide buses can outperform nominally faster GPUs on large models; and the techniques that matter most (quantization, smaller KV caches, batching to raise arithmetic intensity) are all wall-management strategies. Interestingly, Bitcoin mining sits at the opposite extreme: SHA-256 hashing needs almost no memory traffic at all, which is why an ASIC can be nearly pure logic — and why some altcoin designers deliberately built memory-hard algorithms to blunt the ASIC advantage by pressing them against this very wall.

Reasoning about it precisely

The industry's newest answers to the wall are architectural: high-bandwidth memory stacked on silicon interposers millimeters from the compute die, giant on-chip SRAM pools, chiplet layouts that shorten wire runs, and unified-memory designs that eliminate copies between CPU and GPU entirely. Each buys bandwidth at the cost of capacity, price, or flexibility — and none changes the fundamental discipline: know your bytes-per-operation before you buy your FLOPS.

The wall is not a vague complaint; it is quantifiable. See compute-bound vs memory-bound for how to classify a workload, and the roofline model for the standard picture that shows exactly where your hardware's memory ceiling ends and its compute ceiling begins.

The memory wall is the long-running phenomenon in which processor speed has improved far faster than memory bandwidth and latency, so that computer systems are…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners