Mixed Precision Training on CUDA with bfloat16

Hi, I am trying to run the BERT pretraining with amp and bfloat16. First of all, if I specify

with torch.cuda.amp.autocast(dtype=torch.bfloat16):

the output tensor is shown as float16 not bfloat16. When I change the torch.cuda.amp.autocast to torch.autocast(“cuda”, dtype=torch.bfloat16), the output tensor shows bfloat16 datatype.

However, that does not eventually work either. The following error msg shows up:
RuntimeError: expected scalar type BFloat16 but found Half

I was wondering whether the pytorch native amp still lacks support for bfloat16 mixed precision training.

I am trying to run BERT pretraining using the DeepLearningExamples repository by nvidia. Any help or pointers towards the solution would be very much appreciated.

Could you try to run the HuggingFace implementation as the NVIDIA one might use some custom modules depending on float16 (unless their README shows the bfloat16 usage, which might not be integrated yet)?

1 Like


Thanks so much! I was also suspecting the same. Will try the huggingface version then.

On a separate note, I’ve been following your answers a lot in this forum lately. Thanks so much for all the help and support. Really appreciate it.

1 Like