ML / AI

Knowledge Distillation: What the Teacher Knows That the Labels Don't

Distillation is not model compression. It is knowledge transfer - and the signal lives in the soft probability distribution that hard labels discard.

2026-04-07·8 min read·distillation, llm, training, deployment, efficiency, on-premise

Use with AI

ShareX LinkedIn

Knowledge Distillation: What the Teacher Knows That the Labels Don't

A training label tells the model what the correct answer is. It says nothing about how wrong the wrong answers are, or how wrong answers relate to each other.

That missing information is not trivial. A model that has learned that "4" and "9" are more confusable than "4" and "7" knows something useful about digit structure. A model trained only on one-hot labels - this image is a 4, not any other digit - never sees that relationship. The signal is discarded before training begins.

Knowledge distillation is a technique built around recovering that signal. The mechanism: train a smaller student model not against ground truth labels, but against the output distribution of a larger teacher model. The soft probability distribution the teacher produces over all classes carries structure that hard labels throw away. The student learns from that structure.

The Mechanics: Soft Labels and Dark Knowledge

When a well-trained classifier produces a prediction, it outputs a probability distribution over all classes. On a correctly classified example, the correct class might get 0.85. The remaining 0.15 is distributed across wrong classes - but not uniformly. Closely related classes get more probability mass than distant ones.

Geoffrey Hinton, who formalized distillation in his 2015 paper, called this the "dark knowledge" in the model: information about the similarity structure of the problem that is encoded in the output distribution but invisible in the hard labels.

Hard-label training discards this. You collapse the teacher's distribution to argmax, discard the probabilities, and train the student on a one-hot vector. You have thrown away the structure.

Distillation training keeps the teacher's full probability distribution and uses it as the training target for the student. The student minimizes the divergence between its output distribution and the teacher's, rather than between its output and the hard label.

Temperature Scaling

There is a practical problem: on examples the teacher classifies confidently, the soft distribution is still nearly peaked - 0.97 on the correct class, 0.02 and 0.01 distributed across everything else. The ratios between the low-probability classes are meaningful, but they are numerically small enough that gradient updates are dominated by the correct class signal.

Temperature scaling addresses this. Before computing the softmax, you divide the logits by a temperature parameter T. At T=1, you get the standard softmax. At T=4, the distribution flattens: confident predictions become less peaked, and the relative ordering of the low-probability classes becomes more prominent in the gradient.

During distillation training, you apply the same temperature to both the teacher's logits (to generate the soft targets) and the student's logits (to compute the loss). After training, the student runs at T=1 for inference.

The practical effect: higher temperature makes the dark knowledge more accessible to the student. It amplifies the signal in the tails of the teacher's distribution. A temperature of 3–5 is common in practice; the right value is task-dependent and worth tuning.

Three Paradigms

Distillation has evolved beyond the original response-based formulation. There are now three well-established paradigms with different tradeoffs.

Response-based distillation is the original: the student learns to match the teacher's output distribution. This is computationally simple - you only need the teacher's final outputs during training. The student does not need access to the teacher's weights or intermediate representations. This matters when the teacher is a black-box API or a proprietary model.

Feature-based distillation trains the student to match the teacher's intermediate representations - activations from hidden layers, not just the final output. This gives the student more signal: it is learning to represent the problem similarly to the teacher, not just to produce similar outputs. The tradeoff is that you need access to teacher internals, and aligning intermediate representations across architectures of different sizes requires careful design. Hint-based distillation and attention transfer are both variants of this approach.

Relation-based distillation trains the student to match relationships between examples as the teacher encodes them, rather than individual activations. If the teacher represents two examples as close in embedding space, the student should also. This is useful when the exact magnitude of activations matters less than the structure they encode. It is more architecturally flexible - the student and teacher can have substantially different internal dimensions.

Distillation for Language Models

Applying distillation to LLMs introduces complications that do not arise in classification tasks.

In classification, the teacher output is a distribution over a fixed, small set of classes. In autoregressive language generation, the teacher output at each step is a distribution over a vocabulary of 30,000–100,000 tokens. The distillation signal is high-dimensional and the sequence is long. Training the student to match this distribution token-by-token is expensive but feasible.

Logit distillation is the direct application: at each token position, the student minimizes KL divergence between its logit distribution and the teacher's. This is the approach used in DistilBERT and most GPT-class distilled models. It works well when the student and teacher share a vocabulary and when the teacher's outputs are well-calibrated.

Attention distillation trains the student to match the teacher's attention patterns across layers. This is a form of feature-based distillation - the student is learning to "look at" the same parts of the input as the teacher. Useful when the task depends heavily on long-range dependencies.

Hidden state distillation trains the student to match the teacher's intermediate hidden states, often with a learned projection to handle the dimension mismatch between student and teacher layers. This is the most information-rich approach and also the most constrained architecturally.

In practice, most production LLM distillation pipelines combine logit distillation with a cross-entropy loss on the ground truth labels. The combined loss keeps the student grounded in correct answers while giving it access to the teacher's richer signal.

The Quality Ceiling

Distillation cannot make a student better than its teacher on the teacher's tasks. This is a hard ceiling, and it matters.

If your teacher model achieves 91% accuracy on your task, your distilled student will approach some fraction of that - typically 85–90% of the teacher's quality, depending on the compression ratio and how well the distillation was executed. You can close the gap with better distillation design, more training data, and careful hyperparameter tuning. You cannot exceed the teacher.

Where this ceiling matters: if the teacher is already underperforming on your task, distillation will not fix the problem. You need a better teacher first.

Where it does not matter: for many production tasks, a 90%-of-teacher model running at 10x lower latency and 5x lower cost is the right tradeoff. The teacher's additional quality may not translate into business value at the margin - especially for tasks where the output is a routing decision, a classification, or a structured extraction rather than open-ended generation.

Distillation vs Alternatives

When you're facing a deployment constraint - latency, compute, cost, on-premise requirement - distillation is one tool among several. The choice depends on what you have access to.

Quantization reduces the numerical precision of model weights (FP32 → INT8 → INT4). No teacher required, no additional training required for post-training quantization. The tradeoff is quality degradation that becomes steep below INT8 for most tasks, and the compressed model has the same architecture as the original.

Pruning removes weights that contribute little to model outputs. Like quantization, it operates on the existing model without a teacher. Unstructured pruning can achieve high sparsity with limited quality loss; structured pruning (removing entire heads or layers) is more hardware-friendly but more aggressive.

Natively smaller models - choosing Mistral 7B instead of Llama 70B from the start - is often underrated. If your task does not require the larger model's capabilities, you have not used a compression technique at all, and there is no quality ceiling from compression.

Distillation is the right choice when: you have access to a high-quality teacher, you have compute budget for distillation training, and your performance requirements make quantization or pruning alone insufficient. It produces students that are architecturally flexible - the student does not need to be a smaller version of the teacher's architecture - and it often achieves better quality at a given size than starting from a natively smaller model trained on hard labels.

The Regulated-Industry Case

For teams deploying AI in regulated environments - financial services, legal, insurance, healthcare - the deployment constraints are often not negotiable. Data residency requirements may prohibit sending data to a third-party API. Air-gapped environments may make cloud inference architecturally impossible. Latency requirements may make large model inference economically unviable at scale.

Distillation enables a path: train against a large, capable teacher using data you can access in a training environment, then deploy the smaller student in the constrained production environment. The student inherits the teacher's task-specific knowledge at a fraction of the inference cost.

The on-premise calculus is concrete. A 7B parameter student running on a single A10G can handle hundreds of requests per second for a document classification task. The equivalent quality from a 70B teacher would require an 8-GPU cluster at 10x the infrastructure cost. If the task fits the student's quality envelope, the economics are not close.

The key prerequisite is that the task fits the student's quality envelope - which requires measuring the gap between teacher and student on your specific evaluation set, not on benchmark leaderboards. Distillation, like any ML technique, requires task-specific validation before production deployment.

Want this implemented in your workflow?

I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call - no pitch, just a focused conversation about your situation.

Book a strategy call →Download the checklist →

I publish one post like this per month. Join AI Command Room and I'll send it directly to you.