Distributed training works with GLOO but not okay with NCCL! What is the difference between them?


I tried the distributed training on two machines, each using two gpus.

If I use the ‘gloo’ backend, everything goes smoothly. Yet when I changed to “nccl”, it reported a NCCL error as the following.

“NCCL error in: /pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322”

I then did another experiment, one machine with two gpus.
both “gloo” and “nccl” work, and “nccl” is much faster than “gloo”!

So, I am so curious about two questions.
(1) why “gloo” works on multi-machine, but “nccl” does not?
(2) is “nccl” expected to be faster than “gloo”?

Furthermore, how should I choose the backend methods?

Note the CUDA version and NCCL version on two machines are the same. However, the gpu driver versions are different.

1 Like