Torch Distributed Data Parallel and bfloat16 support

Does torch distributed data parallel work on models that have bfloat16 parameters?

The documentation mentions that it works with fp16, so I was wondering if it extends to
bfloat16.

This is an example of an error I get trying to use torch DDP on a model with bfloat16
parameters:

packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Expected torch.FloatTensor, got CPUBFloat16Type

Any insight would be appreciated.

EDIT:
This is for CPUs and not GPUs.

Based on this issue it seems that bfloat16 support is currently only implemented in the NCCL backend (so GPU). However, since the issue was opened in April, support might have been added for the CPU backends as well. Could you check the stacktrace and post where the error is coming from? I also assume you are seeing this error only if you are running a DDP setup?

It occurs in the backward phase (my guess is that it’s preparing for gradient comms across processes).

Traceback (most recent call last):
  .....
  File "bf16_profile.py", line 282, in train_model
    loss.backward()
  File "........./python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "........./python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Expected torch.FloatTensor, got CPUBFloat16Type

Yes, this only occurs if I wrap the model in DDP. The process group was initialized with MPI.
From my investigation since this post I’m pretty sure CPU DDP still doesn’t have support for
bfloat16.