Definition
A Tensor Core is a specialized execution unit inside modern NVIDIA GPUs, first introduced with the Volta architecture. Where a standard GPU core handles general arithmetic, a Tensor Core is purpose-built to perform one fused matrix multiply-accumulate operation per clock cycle — the single most common operation in deep-learning training and inference.
Mixed-precision math
Each Tensor Core computes the operation D = A x B + C on small matrices. The trick is mixed precision: the input matrices A and B are typically supplied in a lower-precision format such as FP16, the multiply produces a full-precision product, and the results are accumulated in higher-precision FP32. This preserves enough numerical accuracy for neural networks while dramatically increasing throughput. A single Volta V100 packed 640 Tensor Cores for roughly 125 mixed-precision teraflops; newer generations add formats like BF16 and FP8.
Why it matters
Tensor Cores are the reason a modern GPU can train and serve large models so much faster than older hardware — they are the workhorse behind the headline FLOPS figures vendors advertise. When evaluating a card for AI, the Tensor-Core count and supported precisions matter far more than the raw graphics specs.
For anyone sizing hardware for a local LLM, Tensor-Core throughput and the available memory bandwidth together determine how usable the rig will be.
In Simple Terms
A Tensor Core is a specialized execution unit inside modern NVIDIA GPUs, first introduced with the Volta architecture. Where a standard GPU core handles general…
