I tried to run the MNIST model on 2 nodes each with 4 GPUs.
I can run it with full 8 GPUs, but when only use part of GPUs on each node, it’ll get stuck.
Here is the code snippet
init_process_group(backend='nccl', init_method='env://', world_size=world_size, rank=rank) torch.cuda.set_device(local_rank)
rank here is the global rank, it will be
[0,1,2,3] for node1 and
[4,5,6,7] for node2, when i use full GPUs
[0,1] for node1 and
[2,3] for node2, when i use only 2 GPUs on each node.
and local_rank is same for both node which is [0, num_used_GPUs_each_node -1]
When I use only two GPUs on each node: on node2, after running init_process_group, something is loaded to GPU[2,3], which is wired.
Since I’ve set the Cuda device, model and data will be loaded into GPU[local_rank].
Then the program will get stuck and waiting forever.
But, when I tried to load model and data to GPU[2,3], everything will work fine.
So my guess is that NCCL will load something to GPU[rank%available_GPU] that is crucial for communication, and if it’s not put together with data and model, the program will get stuck.
I’m not sure if this is a bug or is there is another way to ask NCCL to put things to GPU[local_rank]?