Definition
The memory wall is the long-running phenomenon in which processor speed has improved far faster than memory bandwidth and latency, so that systems are increasingly limited by how fast they can move data rather than how fast they can compute. The term was coined by Wulf and McKee in 1994 to describe a gap that, because it is the difference between two diverging exponentials, has only widened since.
The diverging exponentials
Historically, processor performance improved on the order of 50–60% per year while DRAM latency improved only about 7% per year. Compounded over decades, this opened an enormous relative gap: modern accelerators can perform far more arithmetic per second than their memory systems can supply operands for, leaving expensive compute units stalled and idle, waiting on data.
Why it defines the AI era
The memory wall is the central constraint of large-model inference. Generating each token streams billions of weight parameters out of memory, while doing comparatively little arithmetic per byte — a low-arithmetic-intensity, memory-bound pattern. The practical consequence is that token-generation speed on a local machine is usually set by memory bandwidth (and total VRAM), not by headline TFLOPS. It is also why techniques like quantization help so much: shrinking the weights cuts the bytes that must cross the wall.
For sovereign Bitcoiners self-hosting AI, respecting the memory wall is the difference between a fast and a frustrating local model. See compute-bound vs memory-bound and the roofline model.
In Simple Terms
The memory wall is the long-running phenomenon in which processor speed has improved far faster than memory bandwidth and latency, so that systems are increasingly…
