Knowledge Distillation

Sovereign AI

Knowledge distillation is a model-compression technique that transfers the behaviour of a large, capable teacher model into a smaller, cheaper student model. Instead of training the student only on hard labels — the single correct answer per example — it is trained to reproduce the teacher's full output distribution, the so-called soft labels. Where a hard label says only "the answer is A," the teacher's distribution says "mostly A, but B is plausible and C is not absurd," and that shape carries rich information about how the teacher relates classes or tokens to one another. Students trained on it routinely reach accuracy far above what their size alone would predict, which is why distillation has become a standard stage in producing the small open-weight models the self-hosting world runs on.

How the training works

The student minimizes a loss that measures the gap between its predictions and the teacher's soft targets, most commonly KL divergence, often blended with a conventional loss on the ground-truth labels. A temperature parameter softens the teacher's distribution during training so the student can learn from the relative probabilities of unlikely options, not just the winner. The teacher stays frozen throughout; only the student learns. Variants extend the idea beyond final outputs: feature-based distillation aligns intermediate representations between teacher and student layers, and relation-based distillation transfers the relationships the teacher perceives between examples. For language models specifically, a related and widely used practice is training the student on text generated by the teacher — distilling behaviour through synthetic data rather than logits, useful when the teacher is only reachable through an API.

Why it matters for self-hosters

For a sovereign Bitcoiner running models on local hardware, distillation is a large part of what makes a small model worth running at all. A well-distilled model in the single-digit-billions of parameters captures a surprising share of a frontier model's competence while fitting in the VRAM of a single consumer GPU — no API, no cloud account, no telemetry leaving your network. Many of the popular small open-weight models available through Ollama or as GGUF files were produced or refined with distillation somewhere in their pipeline; smaller size then compounds with quantization to shrink memory further and lift inference speed. Distillation sets the capability ceiling of the small model; quantization decides how cheaply you can serve it.

Limits worth knowing

A student is a compression of its teacher, not a clone. Capability loss is uneven — distilled models tend to hold up well on patterns richly represented in training and degrade on long-tail knowledge and deep multi-step reasoning, where sheer parameter count still matters. Students also inherit their teacher's biases, quirks, and blind spots, including alignment behaviour baked in upstream: whoever distilled the model chose what to preserve. As with any model you pull onto your own metal, provenance is part of the trust decision.

The bigger picture

Distillation belongs to the same shrink-to-fit toolbox as quantization and pruning, but it is the only member that can genuinely improve a small model rather than merely approximate a big one — the student learns from a better signal than raw data alone provides. That is what makes it strategically important for AI sovereignty: it is the mechanism by which frontier-scale capability keeps trickling down to hardware individuals actually own.

Knowledge distillation is a model-compression technique that transfers the behaviour of a large, capable teacher model into a smaller, cheaper student model. Instead of training…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners