Definition
An activation function is the nonlinear step applied to a neuron's weighted sum before it passes to the next layer. Without it, stacking layers would collapse into a single linear transformation no matter how deep the network, so activation functions are what give neural networks their expressive power. In transformers, the activation lives in the feed-forward block that follows attention, and the choice of function measurably affects model quality.
From ReLU to GELU and SwiGLU
Early networks used ReLU, which simply zeroes negative inputs. Transformers moved to GELU (Gaussian Error Linear Unit), introduced in 2016, which weights each input by the probability mass of a Gaussian below it, producing a smoother curve that improved language and vision results. The current frontier favors gated variants like SwiGLU, which split the feed-forward input into a value path and a gate path and multiply them; reported results show SwiGLU lowering loss relative to GELU. Meta's Llama recipe of pre-norm plus RMSNorm plus SwiGLU plus RoPE became the de facto template for open models.
Why the detail matters
Gated activations like SwiGLU use three weight matrices in the feed-forward block instead of two, which changes the parameter budget and the shapes you will see when inspecting or quantizing a checkpoint. Knowing which activation a model uses helps when converting weights for local inference.
The feed-forward block that houses the activation is also the part replaced in sparse models; see Mixture of Experts (MoE) and the normalization that precedes it in Layer Normalization.
In Simple Terms
An activation function is the nonlinear step applied to a neuron’s weighted sum before it passes to the next layer. Without it, stacking layers would…
