HI, all. I try to test my using of KL_loss with the aid of official mnist example –https://github.com/pytorch/examples/tree/master/mnist.
pytorch 0.4.1, epochs = 10, batch_size = 64
Here,
Note that, the return of forward is “return F.log_softmax(x, dim=1)”;
My one_hot code (which is used for KL_divergence loss):
def one_hot(target):
batch_size = target.size()[0]
nb_digits = 10
y_onehot = torch.FloatTensor(batch_size, nb_digits)
y_onehot.zero_()
y_onehot.scatter_(1, torch.unsqueeze(target.cpu(), 1), 1)
return y_onehot
I test 6 cases of the loss function in train(args, model, device, train_loader, optimizer, epoch) :
1.1 loss = F.nll_loss(output, target) # the official default loss, log_softmax + nll_loss = cross entropy loss
1.2 loss = F.nll_loss(output, target, reduction=‘sum’)/batch_size
1.3 loss = F.nll_loss(output, target, reduction=‘sum’)
The result of 1.1 and 1.2 is same, 98%. But the 1.3 cannot converge (loss > 140), and the test accuracy is 11%. The direct reason is that the loss are not averaged over each sample, however, what is the essential cause?
1.4 loss = F.kl_div(output, F.softmax(one_hot(target).to(device), dim=1))
1.5 loss = F.kl_div(output, F.softmax(one_hot(target).to(device), dim=1), reduction=‘sum’)/batch_size
1.6 loss = F.kl_div(output, F.softmax(one_hot(target).to(device), dim=1), reduction=‘sum’)
The test accuracies for 1.4, 1.5, 1.6 are 60%, 93%, 98%, respectively. We all know that minimizing cross-entropy is equivalent to minimizing the KL divergence. It seems that the 1.2 and 1.5 is similar, but, the result of 1.5 is lower (93%). In addition, 1.6, without averaging over each sample (i guess and seems to be unreasonable), gives the similar result with that of 1.2, both 98%. In fact, 1.2 needs 5 epochs to 98% but 1.6 only needs 2 epochs. It is difficult to understand which one of 1.4, 1.5, 1.6 is the “true” KL loss, which is equivalent to commonly used cross-entropy for image classification?
Thanks in advance!