For knowledge distillation (KD), a quick search revealed many different variants on what loss is used, and other variations.
E.g.:
- Cross entropy (CE), for example here:
- Knowledge Distillation Tutorial — PyTorch Tutorials 2.2.0+cu121 documentation
- Knowledge distillation - Wikipedia
- Distilling the Knowledge in a Neural Network, https://arxiv.org/pdf/1503.02531.pdf
- Kullback-Leibler (KL) divergence, for example here:
- Mean squared error (MSE) of the logits
- see comparison paper below
- “Distilling the Knowledge in a Neural Network” mentions that it can be similar
- Jensen–Shannon divergence maybe?
Comparisons:
- Comparing Kullback-Leibler Divergence and Mean Squared Error Loss
in Knowledge Distillation, https://www.ijcai.org/proceedings/2021/0362.pdf
On KL vs CE: Yes, I know it’s the same up to an additional constant (if you consider the teacher prob as constant). So, it should not really make a difference. But still, is there any reason why to choose one over the other?
In all cases, the KD loss is used together with the normal loss (e.g. cross entropy to the ground truth targets), with some loss scales/weighting.
Often, I see the additional factor 1/temperature^2 for the KD loss.
Some related discussion: