I tried the distributed training on two machines, each using two gpus.
If I use the ‘gloo’ backend, everything goes smoothly. Yet when I changed to “nccl”, it reported a NCCL error as the following.
“NCCL error in: /pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322”
I then did another experiment, one machine with two gpus.
both “gloo” and “nccl” work, and “nccl” is much faster than “gloo”!
So, I am so curious about two questions.
(1) why “gloo” works on multi-machine, but “nccl” does not?
(2) is “nccl” expected to be faster than “gloo”?
Furthermore, how should I choose the backend methods?
Note the CUDA version and NCCL version on two machines are the same. However, the gpu driver versions are different.