You might want to look at this paper:
https://arxiv.org/abs/1609.07061
if you’re willing to keep a copy of weights/gradients in FP32 you might be able to reduce the precision of forward/backward step much further than FP16.
1 Like