Knowledge distillation, what loss

For knowledge distillation (KD), a quick search revealed many different variants on what loss is used, and other variations.



On KL vs CE: Yes, I know it’s the same up to an additional constant (if you consider the teacher prob as constant). So, it should not really make a difference. But still, is there any reason why to choose one over the other?

In all cases, the KD loss is used together with the normal loss (e.g. cross entropy to the ground truth targets), with some loss scales/weighting.

Often, I see the additional factor 1/temperature^2 for the KD loss.

Some related discussion: