Definition
GEMM, short for General Matrix Multiply, is the core dense linear-algebra operation defined as C = αAB + βC, where A, B and C are matrices and α, β are scalars. It is the workhorse routine of the BLAS (Basic Linear Algebra Subprograms) standard, and it is the operation that consumes the overwhelming majority of arithmetic in modern AI workloads. Every fully-connected layer, attention projection, and (after an im2col transform) every convolution is ultimately cast as a GEMM.
Why it dominates AI compute
A single transformer layer multiplies large matrices millions of times during both training and inference. Because GEMM is so central, hardware vendors design accelerators specifically to run it fast: GPU tensor cores and other systolic-array units are purpose-built matrix-multiply engines. When you see a chip rated at hundreds of TFLOPS, that headline number is almost always its peak GEMM throughput at a given precision.
Performance and tiling
Naive GEMM is bottlenecked by memory traffic, so high-performance libraries split the work into cache-sized tiles and reuse each loaded value many times. This data reuse is what lifts GEMM from a memory-bound to a compute-bound regime, letting it approach the hardware's peak floating-point rate. The amount of arithmetic done per byte fetched is the operation's arithmetic intensity, which determines where it sits on the roofline.
Understanding GEMM is the foundation for reasoning about real AI hardware efficiency. See the related concepts of FLOPS and arithmetic intensity to connect the math to measurable throughput.
In Simple Terms
GEMM, short for General Matrix Multiply, is the core dense linear-algebra operation defined as C = αAB + βC, where A, B and C are…
