Definition
Gradient descent is the optimization method that actually trains a neural network. Backpropagation computes the gradient of the loss with respect to every weight; gradient descent then steps each weight a small amount in the opposite direction of its gradient, walking the model downhill on the loss surface toward lower error. The size of each step is the learning rate, the single most influential hyperparameter in training. Too large and training diverges; too small and it crawls.
Stochastic and mini-batch variants
Computing the gradient over an entire dataset every step is impractical at scale, so models use stochastic gradient descent, estimating the gradient from a small random mini-batch of examples. This adds noise but is vastly faster and often generalizes better. Modern training almost always uses an adaptive optimizer such as Adam or AdamW, which keep running estimates of each weight's gradient and its variance to set a per-weight step size automatically.
Why it matters for model provenance
Every open-weight model you can self-host is the frozen end state of millions of gradient-descent steps over a training corpus. Understanding this clarifies what fine-tuning actually does: it resumes gradient descent from existing weights on new data, which is how you can adapt a base Transformer to your own domain on hardware you control.
Related entries: backpropagation and layer normalization.
In Simple Terms
Gradient descent is the optimization method that actually trains a neural network. Backpropagation computes the gradient of the loss with respect to every weight; gradient…
