Add two losses as training criterion, not help

rm031 · July 24, 2018, 9:37am

My code is as following, when alpha is set to 0 in the first function and train the network, I expect to get similiar behavior when using second function for training. But I get totally different results!!! Setting alpha to 0 leads to wrong results. This bothers me a lot.

def loss_fn_kd(outputs, labels, teacher_outputs, alpha, T):
    """
    Compute the knowledge-distillation (KD) loss given outputs, labels.
    "Hyperparameters": temperature and alpha
    """

    loss1 = nn.KLDivLoss(size_average=False)(F.log_softmax(outputs/T, dim=1),
                            F.softmax(teacher_outputs/T, dim=1)) * (alpha * T * T)
    loss2 = F.cross_entropy(outputs, labels, size_average=False) * (1. - alpha)

    KD_loss = loss1 + loss2

    return KD_loss / outputs.size(0)

def loss_fn_kd(outputs, labels, teacher_outputs, alpha, T):
    """
    Compute the knowledge-distillation (KD) loss given outputs, labels.
    "Hyperparameters": temperature and alpha
    """

    KD_loss = F.cross_entropy(outputs, labels, size_average=False) * (1. - alpha)

    return KD_loss / outputs.size(0)

Carl · July 24, 2018, 6:33pm

With alpha = 0, both loss_fn_kd are exactly equivalent. I cannot help further with the given information.