Adam optimizer with softmax gives grad_fn=<SoftmaxBackward0>

I’m using the Adam optimizer for backpropagation. In my last layer of my network I have the SoftMaxfunction. I’m using CrossEntropyLoss. I know that SoftMaxwith CrossEntropyLoss is not the best choise as CrossEntropyLoss already performs SoftMax, but in this case the model output gives values between 0 and 1.
When feeding the network one sample, I get an output like this:

Model output: tensor([[1.0000e+00, 1.4013e-45]], device='cuda:0', grad_fn=<SoftmaxBackward0>)

But when I remove SoftMax from the last layer, I get:

Model output: tensor([[-5.9609,  4.6506]], device='cuda:0', grad_fn=<AddmmBackward0>)

What I have seen is that without SoftMax training works faster and loss decreases to almost 0.

I would like to know if there is a difference between these two cases? Why is grad_fn different when I’m using Adam optimizer in both cases?

Hi @undefined,

When using the SoftMax function, you’re predicting a class given an input (which for your model predicts the first class, out of 2 available classes).

In the second case, you’re just outputting the logits and if you fit that to your model, you’re no longer predicting classes based on a probability, but just fitting the output (which isn’t the same).

If you want to read more info, there’s a nice thread with more information here: Logits vs. log-softmax - #2 by KFrank

As I understood it correct using SoftMax with CrossEntropyLoss is bad practice. Thats why I wont use SoftMax. Another thing I’m curious about is which loss function to use as I have 2 classes? CrossEntropyLoss or BCEWithLogitsLoss?