Definition
A residual connection, also called a skip connection, is a shortcut that adds a layer's input directly to its output before passing the result on. Introduced in the 2015 ResNet paper by He and colleagues, it was the breakthrough that made networks hundreds of layers deep trainable. Instead of forcing each block to learn a full transformation, the block only has to learn the residual, the difference from the identity, which is a far easier target.
Why deep models need it
When errors are propagated backward through many layers during backpropagation, the gradient signal tends to shrink toward zero, the vanishing-gradient problem. A residual connection gives that signal a clean additive path straight to earlier layers, so it survives the journey through a deep stack. This is what lets a Transformer stack dozens of attention and feed-forward blocks and still train successfully.
In the Transformer block
Every sub-block in a Transformer, both the self-attention and the feed-forward network, is wrapped in a residual connection paired with layer normalization. The pattern is simple but indispensable: output equals input plus the normalized sub-block result. Remove the residual paths and a deep Transformer fails to converge at all.
For the bigger picture of how these pieces assemble into a working model, see Transformer.
In Simple Terms
A residual connection, also called a skip connection, is a shortcut that adds a layer’s input directly to its output before passing the result on.…
