Definition
Every workload running on an accelerator is limited by one of two resources at any moment: the speed of its arithmetic units or the speed of its memory system. A workload is compute-bound when it is limited by how fast the processor can do math, and memory-bound when it is limited by how fast data can be fetched from memory. Knowing which regime you are in tells you exactly what to optimize — and which hardware upgrade will actually help.
The deciding factor
The dividing line is arithmetic intensity (operations per byte) compared against the hardware's ridge point — its peak FLOPS divided by its peak bandwidth. If a kernel's intensity is above that ratio, the chip can keep its math units fed and the workload is compute-bound; below it, the math units starve waiting on memory and the workload is memory-bound. This is the formal logic behind the roofline model.
What it means for AI in practice
Training large models on big batches is typically compute-bound: huge GEMMs reuse data heavily, so faster math units or lower-precision formats give the win. Single-stream inference — the case for a local, self-hosted model — is usually memory-bound: each token streams the full weight set through once, so memory bandwidth and capacity dominate, and extra TFLOPS sit idle. The right diagnosis prevents buying the wrong hardware.
Diagnosing the regime is the first step in any AI-performance work. See arithmetic intensity and the memory wall to understand why memory so often wins.
In Simple Terms
Every workload running on an accelerator is limited by one of two resources at any moment: the speed of its arithmetic units or the speed…
