Hi, I am trying to use pytorch/tests/cpp/c10d/ProcessGroupNCCLTest.cpp in a 1 CPU , 8 GPUs linux machine for some tests. I found that in the Allgahter test, the size_ in ProcessGroup.hpp has a value of 1. As a rule of thumb, isn’t size_ the number of cuda devices 8? Why is it 1?
pg->getSize() should return the world size, but I’m currently unsure how you are executing the NCCL tests and which part of the code you are checking.
I recompiled pytorch in the NGC image and commented out the other tests except for Allgather in the file pytorch/test/cpp/c10d/ProcessGroupNCCLTest.cpp. Also, I print size_ in ProcessGroupNCCL.cpp. Finally, I ran ProcessGroupNCCLTest, the compiled executable, and got size_ to be 1. I’m guessing that ProcessGroupNCCLTest generated the data on the CPU and then copied it to the GPU, so size_ prints out the number of cpus.