I’m using the Adam optimizer for backpropagation. In my last layer of my network I have the SoftMaxfunction. I’m using CrossEntropyLoss. I know that SoftMaxwith CrossEntropyLoss is not the best choise as CrossEntropyLoss already performs SoftMax, but in this case the model output gives values between 0 and 1.
When feeding the network one sample, I get an output like this:
Model output: tensor([[1.0000e+00, 1.4013e-45]], device='cuda:0', grad_fn=<SoftmaxBackward0>)
But when I remove SoftMax from the last layer, I get:
Model output: tensor([[-5.9609, 4.6506]], device='cuda:0', grad_fn=<AddmmBackward0>)
What I have seen is that without SoftMax training works faster and loss decreases to almost 0.
I would like to know if there is a difference between these two cases? Why is grad_fn different when I’m using Adam optimizer in both cases?