Definition
Knowledge distillation is a model-compression technique that transfers the behaviour of a large, capable teacher model into a smaller, cheaper student model. Instead of training the student only on hard labels (the single correct answer), it is trained to reproduce the teacher's full output distribution — the so-called soft labels or probabilities. Those soft targets carry richer information about how the teacher relates classes or tokens to one another, so the student often reaches accuracy far above what its size alone would predict.
Why it matters for self-hosters
For a sovereign Bitcoiner running models on local hardware, distillation is what makes a 7-billion-parameter model worth running at all. A distilled small model can capture much of a frontier model's competence while fitting in the VRAM of a single consumer GPU — no API, no cloud account, no telemetry leaving your network. Many of the popular small open-weight models you can run locally were produced or refined with distillation as part of the pipeline.
How the training works
The student minimises a loss that measures the gap between its predictions and the teacher's soft targets, commonly using KL divergence. Variants also align intermediate feature representations (feature-based distillation) or the relationships between examples (relation-based distillation), not just the final outputs. The teacher stays frozen; only the student learns.
Distillation is closely related to other shrink-to-fit techniques you will encounter when running models on your own metal. See quantization for reducing numerical precision, and parameter count for what model size actually measures.
In Simple Terms
Knowledge distillation is a model-compression technique that transfers the behaviour of a large, capable teacher model into a smaller, cheaper student model. Instead of training…
