Bfloat16 from float16 issues

kleingeo · April 1, 2024, 3:12pm

I have a complicated multi-loss model that I currently train with float16 without issues. I am trying to change it to bfloat16, but the model is no longer stable and some of the losses are not converging as expected. Other than some obvious reasons (the autocast not specifying bfloat16 or accidentally using gradient scaling), what could be some reasons this is happening? I am also using the latest version of torch with A100 GPUs (so hardware issues).

Edit: The problem was solved by using a 1e-3 rather than 1e-7 eps value in one of the loss functions to account for potential zero values.