Activation Function

Sovereign AI

Activation function is the nonlinear step applied to a neuron's weighted sum before the result passes to the next layer. Without it, stacking layers would collapse into a single linear transformation no matter how deep the network, so the activation function is quite literally what gives a neural network its expressive power. In transformer language models, the activation lives inside the feed-forward block that follows attention in every layer, and the choice of function has a measurable effect on model quality, training stability, and the shape of the weights you will find when you open a checkpoint.

The concept long predates language models: the earliest artificial neurons used hard threshold steps, and the sigmoid and tanh functions that succeeded them dominated for decades before their vanishing-gradient problems pushed the field toward the modern rectifier family. Every generation of activation function has been a negotiation between mathematical convenience, gradient behavior during training, and raw computational cost, and the winners have usually been the functions that trained deepest networks most reliably rather than the most theoretically elegant ones.

From ReLU to GELU to gated units

Early deep networks standardized on ReLU, the rectified linear unit, which simply zeroes negative inputs and passes positive ones through. It is cheap and avoids the vanishing gradients that plagued older sigmoid-style functions, but its hard cutoff can permanently deactivate neurons. Transformers largely moved to GELU, the Gaussian Error Linear Unit introduced in 2016, which weights each input by the probability mass of a Gaussian below it. The result is a smooth curve that behaves like ReLU for large values while letting small negative signals through in proportion, and it improved results across language and vision benchmarks. The current frontier favors gated variants such as SwiGLU, which split the feed-forward input into a value path and a gate path and multiply them together, letting the network learn what to pass and what to suppress. Reported results show gated units lowering loss relative to plain GELU at equal compute, and the recipe of pre-norm, RMSNorm, SwiGLU, and rotary embeddings popularized by Meta's Llama models became the de facto template for open-weight models. The normalization that precedes the block is covered under layer normalization.

Why the detail matters to a self-hoster

This is not trivia if you run models on your own hardware. Gated activations use three weight matrices in the feed-forward block instead of two, which changes the parameter budget and the tensor shapes you will see when inspecting, converting, or quantizing a checkpoint for local inference. A conversion script that assumes the wrong activation produces a model that loads and then generates garbage, and the mismatch is easier to diagnose when you know what the architecture should contain. Activation choice also interacts with quantization behavior, since different functions produce different activation distributions, and it determines a meaningful slice of inference compute, because the feed-forward block accounts for a large share of a transformer's parameters and FLOPs.

Where the activation sits in the bigger picture

The feed-forward block that houses the activation is also the component that sparse architectures replace: a Mixture of Experts model swaps the single feed-forward network for many expert networks and routes each token to a few of them, activation function included. Downstream of every layer, the model's final scores are turned into probabilities by softmax, a different kind of nonlinearity with a different job. And when you adapt a model to your own data through fine-tuning, the activation is part of what determines how gracefully the network absorbs new patterns. For a sovereignty-minded operator, the takeaway is simple: the activation function is a small design decision with large consequences, and knowing which one your model uses is part of actually owning your stack rather than merely renting the output of someone else's.

Activation function is the nonlinear step applied to a neuron’s weighted sum before the result passes to the next layer. Without it, stacking layers would…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners