My code is as following, when alpha is set to 0 in the first function and train the network, I expect to get similiar behavior when using second function for training. But I get totally different results!!! Setting alpha to 0 leads to wrong results. This bothers me a lot.
def loss_fn_kd(outputs, labels, teacher_outputs, alpha, T):
"""
Compute the knowledge-distillation (KD) loss given outputs, labels.
"Hyperparameters": temperature and alpha
"""
loss1 = nn.KLDivLoss(size_average=False)(F.log_softmax(outputs/T, dim=1),
F.softmax(teacher_outputs/T, dim=1)) * (alpha * T * T)
loss2 = F.cross_entropy(outputs, labels, size_average=False) * (1. - alpha)
KD_loss = loss1 + loss2
return KD_loss / outputs.size(0)
def loss_fn_kd(outputs, labels, teacher_outputs, alpha, T):
"""
Compute the knowledge-distillation (KD) loss given outputs, labels.
"Hyperparameters": temperature and alpha
"""
KD_loss = F.cross_entropy(outputs, labels, size_average=False) * (1. - alpha)
return KD_loss / outputs.size(0)