Based on this issue it seems that bfloat16 support is currently only implemented in the NCCL backend (so GPU). However, since the issue was opened in April, support might have been added for the CPU backends as well. Could you check the stacktrace and post where the error is coming from? I also assume you are seeing this error only if you are running a DDP setup?
It occurs in the backward phase (my guess is that it’s preparing for gradient comms across processes).
Traceback (most recent call last):
.....
File "bf16_profile.py", line 282, in train_model
loss.backward()
File "........./python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "........./python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: Expected torch.FloatTensor, got CPUBFloat16Type
Yes, this only occurs if I wrap the model in DDP. The process group was initialized with MPI.
From my investigation since this post I’m pretty sure CPU DDP still doesn’t have support for
bfloat16.