Does Pytorch 1.8 binaries support distributed data parellel on AMD?
What should I use as the communication backend, nccl or gloo?

If you’re using the ROCm binaries, using the “nccl” backend would work since it would transparently use rccl under the hood.

Thank you, but I still get NCCL error while initializing model with model = DDP(model, …)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, internal error, NCCL version 2.7.8
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

The error is unfortunately too generic to indicate the root cause. A few sanity checks:

  1. What command did you use to install the pytorch 1.8 binaries?
  2. Are you ensuring that each rank in your distributed model is being assigned to a different device ID?