Distillation
Distillation is a training technique where a smaller "student" model learns to replicate the behavior of a larger "teacher" model, typically by training on the teacher's outputs rather than (or in addition to) original labeled data.
Details
The goal is to transfer the teacher's capabilities into a model that is cheaper and faster to run at inference time, at the cost of some capability loss. Distillation can target the teacher's output probabilities (logprobs / soft labels), its intermediate representations, or simply its generated text (synthetic data).
It is commonly used by model developers to create smaller variants of flagship models - including small language models - and by application developers to produce task-specific models from general-purpose ones. Open-weight models are frequently used as teacher models in distillation workflows, since weight access is not strictly required for distillation (only the teacher's outputs are needed), but open-weight availability makes it practical to run the teacher locally for large-scale output generation.
Distillation often occupies a middle ground between pretraining (broad learning from raw data) and task-specific fine-tuning (narrow adaptation), but it can be applied at various stages of the training pipeline - including during pretraining itself (training a smaller model from scratch with teacher supervision) or after fine-tuning. The result is a model that retains much of the teacher's general ability while being significantly more efficient. Combined with model quantization, distillation is a primary technique for making large models practical in resource-constrained deployment settings.
Cascade distillation is a variant that alternates pruning (removing less-important parameters) and distillation to produce progressively smaller models from a single parent. Each pruned model is trained to mimic the original teacher, then serves as the starting point for the next smaller variant. This approach can produce a family of small language models at a fraction of the training cost of pretraining each size independently.
Examples
- A model developer releasing a 8B-parameter distilled variant of a 70B-parameter flagship model.
- An application developer distilling a general-purpose model into a task-specific model that handles a narrow domain at lower cost.
- Training a student model on synthetic responses generated by a teacher model for a specific task.
Synonyms
knowledge distillation, model distillation