Definition
Layer normalization, introduced by Ba, Kiros, and Hinton in 2016, is a technique that stabilizes neural network training by re-scaling the activations inside each layer. For every token it computes the mean and standard deviation across that token's feature dimensions and normalizes the activations to zero mean and unit variance, then applies a learned scale and shift. Unlike batch normalization, it does not depend on other examples in the batch, so it behaves identically during training and inference, which is essential for the variable-length sequences a Transformer processes.
Where it sits in a Transformer
Each attention and feed-forward sub-block is paired with a layer-norm step and a residual connection. Placing normalization before the sub-block (pre-norm) rather than after it makes very deep stacks far easier to train, which is why nearly all modern LLMs are pre-norm. Without this normalization, activations drift in scale as signals pass through dozens of layers and gradient descent becomes unstable.
RMSNorm, the common variant
Many current open-weight models replace classic layer norm with RMSNorm, which divides only by the root-mean-square of the activations and drops the mean-subtraction and bias terms. It is cheaper to compute and works just as well in practice, a small efficiency win that adds up when you run inference on your own hardware.
Related: residual connection and backpropagation.
In Simple Terms
Layer normalization, introduced by Ba, Kiros, and Hinton in 2016, is a technique that stabilizes neural network training by re-scaling the activations inside each layer.…
