DDP with AMD ROCm

davidshisui · April 21, 2021, 8:55am

Does Pytorch 1.8 binaries support distributed data parellel on AMD?
What should I use as the communication backend, nccl or gloo?

pritamdamania87 · April 22, 2021, 12:20am

If you’re using the ROCm binaries, using the “nccl” backend would work since it would transparently use rccl under the hood.

davidshisui · April 22, 2021, 9:38am

Thank you, but I still get NCCL error while initializing model with model = DDP(model, …)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, internal error, NCCL version 2.7.8
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

jithunnair-amd · April 23, 2021, 1:30am

The error is unfortunately too generic to indicate the root cause. A few sanity checks:

What command did you use to install the pytorch 1.8 binaries?
Are you ensuring that each rank in your distributed model is being assigned to a different device ID?