Model FLOPs Utilization (MFU)

Sovereign AI

Model FLOPs Utilization (MFU) measures how efficiently an AI workload uses its hardware. It is the ratio of the useful floating-point operations the model actually requires per second to the theoretical peak FLOPS of the hardware. Popularized by Google's PaLM team, MFU has become the standard apples-to-apples efficiency metric across different accelerators and software stacks: a single percentage that says how much of the silicon you paid for is doing real model math.

How It Is Computed

You take the number of FLOPs needed for one forward (and, in training, backward) pass of the model, multiply by the throughput in tokens or samples per second, and divide by the cluster's aggregate peak FLOPS. For dense transformers the numerator has a convenient approximation — roughly six FLOPs per parameter per training token — which is why MFU can be estimated from nothing more than parameter count, token throughput, and the accelerator's spec sheet. Because the numerator counts only the model's intrinsic, implementation-independent arithmetic, MFU rewards genuine efficiency rather than wasted or redundant computation. A related metric, hardware FLOPs utilization (HFU), counts every operation the hardware actually performs, including activation recomputation; HFU is always at least as high as MFU, and the gap between them is the price of memory-saving tricks.

What Good Looks Like

Achieving 100% MFU is impossible in practice: memory bandwidth limits, inter-GPU communication, pipeline bubbles, kernel launch overhead, and attention's memory-bound phases all eat into it. Well-tuned large-model training typically lands in the 35–55% range — for reference, Llama 3.1 training reported roughly 38–43% MFU on H100 clusters. Inference is usually far worse, because small-batch, autoregressive decoding is dominated by moving weights through memory rather than by arithmetic; single-digit MFU during interactive chat is normal, not a bug. A low MFU signals that the hardware is stalling — often because the workload has slid into a memory-bound regime rather than staying compute-bound, exactly the distinction the roofline model formalizes.

Why It Matters on Hardware You Own

Miners will recognize the shape of this metric immediately: MFU is to an AI accelerator what realized efficiency is to an ASIC — the gap between the datasheet and what your rig actually delivers. For a sovereign builder fine-tuning or serving models on owned GPUs, MFU is the number that tells you whether to spend money or spend engineering. Doubling MFU through larger batch sizes, better kernels, quantization-aware serving, or sequence packing is equivalent to buying twice the hardware, at zero capital cost. It also keeps vendors honest: two setups with identical GPUs can differ by 2–3× in delivered training speed purely on software maturity, and MFU exposes that difference in one number.

Reading the Number Correctly

Track MFU over a run, not just at the start — dataloader stalls, checkpointing, and stragglers show up as sustained dips. Compare like with like: MFU at different precisions uses different peak-FLOPS denominators, so a "higher" number at FP8 is not automatically better engineering than a lower one at BF16. And remember what the metric cannot see: MFU measures how fast you compute, not whether the computation is worth doing. A perfectly utilized cluster training on bad data is still waste — just efficient waste. MFU turns the abstract roofline into a single number you can track and improve; the theory behind why the gap from 100% exists lives in the roofline model, and the units live in our FLOPS entry.

A worked example shows how little ceremony the estimate needs: a 7-billion-parameter model trained at 10,000 tokens per second requires roughly 6 × 7B × 10,000 ≈ 4.2 × 10¹⁷ FLOPs per second of useful work; divide that by your GPUs' combined peak at the precision you train in, and the quotient is your MFU. If the number embarrasses you, the fix is almost always software — batch size, kernels, data pipeline — long before it is new hardware.

Model FLOPs Utilization (MFU) measures how efficiently an AI workload uses its hardware. It is the ratio of the useful floating-point operations the model actually…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners