KL_divergence loss & NLL_loss for mnist

HI, all. I try to test my using of KL_loss with the aid of official mnist example –https://github.com/pytorch/examples/tree/master/mnist.
pytorch 0.4.1, epochs = 10, batch_size = 64
Note that, the return of forward is “return F.log_softmax(x, dim=1)”;
My one_hot code (which is used for KL_divergence loss):
def one_hot(target):
batch_size = target.size()[0]
nb_digits = 10
y_onehot = torch.FloatTensor(batch_size, nb_digits)
y_onehot.scatter_(1, torch.unsqueeze(target.cpu(), 1), 1)
return y_onehot

I test 6 cases of the loss function in train(args, model, device, train_loader, optimizer, epoch) :
1.1 loss = F.nll_loss(output, target) # the official default loss, log_softmax + nll_loss = cross entropy loss
1.2 loss = F.nll_loss(output, target, reduction=‘sum’)/batch_size
1.3 loss = F.nll_loss(output, target, reduction=‘sum’)
The result of 1.1 and 1.2 is same, 98%. But the 1.3 cannot converge (loss > 140), and the test accuracy is 11%. The direct reason is that the loss are not averaged over each sample, however, what is the essential cause?

1.4 loss = F.kl_div(output, F.softmax(one_hot(target).to(device), dim=1))
1.5 loss = F.kl_div(output, F.softmax(one_hot(target).to(device), dim=1), reduction=‘sum’)/batch_size
1.6 loss = F.kl_div(output, F.softmax(one_hot(target).to(device), dim=1), reduction=‘sum’)
The test accuracies for 1.4, 1.5, 1.6 are 60%, 93%, 98%, respectively. We all know that minimizing cross-entropy is equivalent to minimizing the KL divergence. It seems that the 1.2 and 1.5 is similar, but, the result of 1.5 is lower (93%). In addition, 1.6, without averaging over each sample (i guess and seems to be unreasonable), gives the similar result with that of 1.2, both 98%. In fact, 1.2 needs 5 epochs to 98% but 1.6 only needs 2 epochs. It is difficult to understand which one of 1.4, 1.5, 1.6 is the “true” KL loss, which is equivalent to commonly used cross-entropy for image classification?

Thanks in advance!

if you compare the gradients of 1.2 and 1.3, you will find 1.3’s gradient is 1.2’s gradient multiplied by the batch_size, which in this case is 64. Another way to look at this would be that 1.3 is equivalent to training using 1.2 but with the learning rate multiplied by the batch_size = 64. This would be a pretty large effective learning rate and might be causing problems with the optimization. if you are doing 1.3, i suggest dividing the learning rate by 64.

Yes, you are right, i solved it. Thanks a lot.
For 1.4 ~1.6, i made a stupid error. one_hot(target) is already class probability, so do not need F.softmax.