Why double precision training sometimes performs much better?

I trained a variation of LSTM for language modelling with double precision (used DoubleTensors everywhere), it trained very fast, the resulting accuracy is really good. Then I tried to train the same model with single precision (FloatTensors everywhere) and it would not converge, the loss stopped decreasing without being even close to the loss of the double model.

What is a good point to start investigaing the issue? I am thinking about plotting the activations / grads with tensorboard to see if they are too low/high and therefore dont fit within the single precision limits, is it a good strategy?

decrease the learning rate, maybe your training is numerically a bit unstable. Plotting activation / grad norms is indeed a good start.