Softmax/tanh/sigmoid+crossEntropy leads to nan in gradients

Hi, I’ve tried the above combinations for training the network and it turns out that softmax+crossEntropy work worst in my case (gradients easily blow up) and tanh works better than sigmoid but still leads to gradients = nan at the end.

I also tried the logSoftmax+crossEntropy which is much more stable than all the combinations above, but, still leads to gradients = nan, at the very end.

Is there any suggestion for dealing with this issue? Thanks!

1 Like

Note that CrossEntropy takes as input a vector of activations and not log probabilities.
CrossEntropy is implemented as LogSoftMax + NLLLoss.

So I would say that it is expected that softmax+crossentropy do not work well.

1 Like