Hi, I’ve tried the above combinations for training the network and it turns out that softmax+crossEntropy work worst in my case (gradients easily blow up) and tanh works better than sigmoid but still leads to gradients = nan at the end.

I also tried the logSoftmax+crossEntropy which is much more stable than all the combinations above, but, still leads to gradients = nan, at the very end.

Is there any suggestion for dealing with this issue? Thanks!