Try analyzing the magnitude of your gradients. Maybe your loss is suffering from some numerical problems. Then, try changing the formula of your loss without changing its objective. For instance, it is well known that it’s generally better to use nn.LogSofmax
instead of nn.Softmax
. This thread discusses this observed property.