Model Distillation

Sovereign AI

Model distillation, also called knowledge distillation, is a compression technique in which a small student model is trained to imitate the behavior of a larger, more capable teacher model. The goal is to retain most of the teacher's performance while shrinking the parameter count, memory footprint, and inference cost, turning a model that needs a rack of accelerators into one that runs acceptably on a single GPU or a capable workstation. It is one of the main reasons genuinely useful language models now run on hardware an individual can own.

The technique was formalized in a 2015 paper by Hinton and colleagues, which introduced the teacher-student framing and the use of a softened output distribution as the training signal, though the underlying idea of compressing a large model's behavior into a smaller one had earlier roots. What began as a compression trick for deploying models on phones has become a central mechanism of the open-model economy, where each new frontier release is followed by waves of smaller models trained in its image.

How it works

The classic formulation trains the student not on hard, one-hot labels but on the teacher's full output distribution, the soft probabilities it assigns across all possible answers. Those soft targets carry far more information per example than a bare label: they encode which wrong answers the teacher considers nearly right, which distinctions it treats as close calls, and how confident it is, a structure sometimes called dark knowledge. A student learning that a given input yields 70 percent one answer and 25 percent a near-miss learns the shape of the teacher's judgment, not just its verdicts, and so trains faster and generalizes better than it would from labels alone. Researchers distinguish response-based distillation, which matches final outputs, from feature-based and relation-based variants that also match internal representations. In modern LLM practice, the term covers a spectrum from strict distribution-matching to the pragmatic recipe of generating high-quality outputs with a large teacher and using them as fine-tuning data for a small student.

Why it matters for sovereign AI

Distillation is a load-bearing wall of the local-AI stack. Frontier-scale models are trained and served in data centers, but distilled descendants of that capability run on consumer hardware, and that difference is the difference between renting intelligence and owning it. A distilled open-weight model on your own machine keeps prompts, outputs, and private data under your control, works offline, and cannot be repriced, rate-limited, or retired by a vendor, the same sovereignty argument that favors running your own node over trusting someone else's. Distilled models also serve as draft models in speculative decoding, where a small student proposes tokens that its larger relative verifies, accelerating local inference. One caveat belongs in any honest account: distilling from a proprietary model's outputs may conflict with that provider's terms of use, so the provenance of teacher data matters, and the cleanest lineages are open-weight teachers with permissive licenses.

Limits and neighbors

Distillation is lossy. The student inherits the teacher's broad competence but loses depth at the edges, long-tail knowledge, subtle reasoning, and robustness on unusual inputs degrade first, and no amount of clever training makes a small model a free replica of a large one. It also inherits the teacher's flaws, biases and blind spots included. In the efficiency toolkit it sits beside quantization, which shrinks the numbers inside a model rather than the model itself, and upstream of preference tuning such as Direct Preference Optimization, which can then shape the distilled student's behavior. Together they form the pipeline by which capability trained at industrial scale ends up answering questions on hardware you can unplug.

Model distillation, also called knowledge distillation, is a compression technique in which a small student model is trained to imitate the behavior of a larger,…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners