Adam+Half Precision = NaNs?

You might want to look at this paper:
https://arxiv.org/abs/1609.07061
if you’re willing to keep a copy of weights/gradients in FP32 you might be able to reduce the precision of forward/backward step much further than FP16.

1 Like