Gradient Descent

Sovereign AI

Gradient descent is the optimization method that actually trains a neural network. Backpropagation computes the gradient of the loss with respect to every weight — which direction each weight should move to reduce error — and gradient descent then steps each weight a small amount in the opposite direction of its gradient, walking the model downhill on the loss surface. Repeat millions of times over enough data, and a randomly initialized network becomes a language model. The size of each step is the learning rate, arguably the single most influential hyperparameter in training: too large and the optimization overshoots and diverges; too small and it crawls, burning compute for negligible progress.

Stochastic and mini-batch variants

Computing the exact gradient over an entire dataset for every step is impractical at scale, so training uses stochastic gradient descent (SGD): estimating the gradient from a small random mini-batch of examples. This makes each step noisy but vastly cheaper, and the noise itself often helps, nudging the model out of poor local basins and toward solutions that generalize better. Modern practice layers refinements on top — momentum accumulates a running direction across steps so the optimizer coasts through noisy terrain, and adaptive optimizers such as Adam and AdamW keep per-weight estimates of gradient mean and variance to set an effective per-weight step size automatically. Learning-rate schedules add the finishing touch: a gentle warmup, then gradual decay as the model settles.

The loss landscape intuition

A useful mental picture: the loss surface of a large model is a landscape in billions of dimensions, and gradient descent is a hiker who can only feel the slope underfoot. Remarkably, this local, greedy procedure works — in very high dimensions, the landscape of overparameterized networks turns out to be navigable, with many good minima rather than one needle in a haystack. There is no global map and no guarantee of the best solution, only an empirically excellent one. Every capability in a trained model was carved by this blind downhill walk, which is worth remembering when reasoning about what models can and cannot reliably do.

Why it matters for model provenance

Every open-weight model you can self-host is the frozen end state of an enormous gradient-descent run over a training corpus. Understanding this clarifies what fine-tuning actually does: it resumes gradient descent from existing weights on new data, usually with a small learning rate so the new signal adjusts rather than overwrites what the base model learned. Parameter-efficient methods like LoRA constrain which directions the descent may move in, which is why they run on consumer GPUs. For the sovereignty-minded, the takeaway is practical: pretraining from scratch is out of reach for individuals, but resuming the descent — adapting a base Transformer to your own domain, on hardware you control, with data that never leaves your machine — is entirely achievable.

The knobs you'll actually touch

Fine-tuning on your own hardware puts a handful of gradient-descent controls in your hands. Learning rate dominates: too high shows up as loss spiking or the model forgetting its base abilities; too low as a loss curve that barely moves. Batch size trades gradient quality against memory, with gradient accumulation as the standard trick for simulating large batches on a small GPU. Epochs control repetition over your dataset — small datasets overfit within a few passes, visible as training loss falling while validation loss climbs. A short warmup and a decaying schedule are near-universal defaults worth keeping. The habit that matters most is watching the curves rather than the clock: a run whose validation loss has flatlined is finished regardless of how many epochs remain in the plan.

Related entries: backpropagation, which supplies the gradients; layer normalization, which keeps deep networks trainable at all; and fine-tuning, where you get to hold the controls yourself.

Gradient descent is the optimization method that actually trains a neural network. Backpropagation computes the gradient of the loss with respect to every weight —…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners