Pytorch equivalence to sparse softmax cross entropy with logits in TensorFlow

deJQK · May 27, 2018, 8:34am

Is there pytorch equivalence to sparse_softmax_cross_entropy_with_logits available in tensorflow?

I found CrossEntropyLoss and BCEWithLogitsLoss, but both seem to be not what I want. I ran the same simple cnn architecture with the same optimization algorithm and settings, tensorflow gives 99% accuracy in no more than 10 epochs, but pytorch converges to 90% accuracy (with 100 epochs simulation). Another thing is that BCEWithLogitsLoss requires one-hot form of labels (CrossEntropyLoss accepts integer valued labels).

If there is no such equivalent, is that possible to implement it manually?

rasbt · May 27, 2018, 3:37pm

Note that BCELoss and BCEWithLogitsLoss is for binary labels.

There’s NLLLoss, i.e., negative log-likelihood, https://pytorch.org/docs/stable/nn.html#nllloss (operates on softmax output) and CrossEntropyLoss, which already combines the softmax plus the negative-likelihood for you:
https://pytorch.org/docs/stable/nn.html#crossentropyloss

Also note that CrossEntropyLoss uses nn.LogSoftmax, not nn.Softmax – the log version is numerically more stable (not sure how TensorFlow implements their negative log likelihood cost func though, i.e., whether they use log softmax or softmax on the logits). You could combine nn.Softmax and nn.LLLOSS if you like and see if that replicates the TensorFlow outputs.

The convergence difference you mentioned can have many different reasons including the random seed for the weight initialization and the optimizer parameterization.

import torch.nn.functional as F
loss = F.nll_loss(F.softmax(input), target)

The disadvantage of using softmax and the NLL loss separately is that it’s numerical less stable than using the derivative of the NLL loss with respect to the activation function directly.

I am not sure about the “sparse” part though, but it shouldn’t affect the results.

ptrblck · May 27, 2018, 7:33pm

One side note:
nn.NLLLoss should be used with nn.LogSoftmax, not nn.Softmax directly.
So basically a logic output and nn.CrossEntropyLoss or a nn.LogSoftmax output with nn.NLLLoss yield identical losses:

m = nn.LogSoftmax(dim=1)

criterion1 = nn.CrossEntropyLoss()
criterion2 = nn.NLLLoss()

x = torch.randn(1, 5)
y = torch.empty(1, dtype=torch.long).random_(5)

loss1 = criterion1(x, y)
loss2 = criterion2(m(x), y)
print(loss1)
print(loss2)

deJQK · May 28, 2018, 3:23am

@rasbt @ptrblck Thanks for the explanation. The code I am modifying is from the MNIST example on this website.

I also tried the example from the official website of PyTorch, it is fast and converge to good performance. Even if I remove the dropout part, and modify the neural network as the tensorflow example, I get better results (around 95%).

I will focus on the official example from PyTorch. Thanks for your explanations.

rasbt · May 28, 2018, 3:43am

If helpful, I have a collection of implementations in Jupyter Notebooks where most of the multi-layer perceptrons and convnets are based on MNIST. For most, there’s a TensorFlow and a PyTorch implementation if you’d like to compare the two: https://github.com/rasbt/deep-learning-book/tree/master/code/model_zoo/

I haven’t particularly fine-tuned any of the networks, but they seem to perform decently on MNIST. E.g., ~98% test accuracy via the multi-layer perceptron with batch norm and ~99% via a super simple ResNet.

deJQK · May 28, 2018, 2:07pm

Thanks @rasbt. I will learn them when I need. BTW, I like your book Python Machine Learning and learned a great many from it two years ago.

ISMAX · March 26, 2021, 9:45pm

As it might be helpful to others, I posted an example of usage of nn.CrossEntropyLoss for image segmentation here :