Question about init_process_group

I tried to run the MNIST model on 2 nodes each with 4 GPUs.
I can run it with full 8 GPUs, but when only use part of GPUs on each node, it’ll get stuck.
Here is the code snippet

init_process_group(backend='nccl', init_method='env://', world_size=world_size, rank=rank)
torch.cuda.set_device(local_rank)

rank here is the global rank, it will be
[0,1,2,3] for node1 and [4,5,6,7] for node2, when i use full GPUs
[0,1] for node1 and [2,3] for node2, when i use only 2 GPUs on each node.
and local_rank is same for both node which is [0, num_used_GPUs_each_node -1]

When I use only two GPUs on each node: on node2, after running init_process_group, something is loaded to GPU[2,3], which is wired.
Since I’ve set the Cuda device, model and data will be loaded into GPU[local_rank].
Then the program will get stuck and waiting forever.

But, when I tried to load model and data to GPU[2,3], everything will work fine.

So my guess is that NCCL will load something to GPU[rank%available_GPU] that is crucial for communication, and if it’s not put together with data and model, the program will get stuck.

I’m not sure if this is a bug or is there is another way to ask NCCL to put things to GPU[local_rank]?

Thanks!

Just to try understanding your situation better. Let me know if the following description is correct:
You’re using 2 nodes, each with 4 GPUs. You want to use 2 GPUs on each node, which means your intended world size is 4.
The global rank of processes on node 1 are {0, 1}, and the global ranks of processes on node 2 are {2, 3}.

To achieve this, you can use CUDA_VISIBLE_DEVICES before launching your training script. For example, if you set CUDA_VISIBLE_DEVICES=1,2, the training script will not see the rest of the GPUs. Then you can simply initialize the process groups on each process with the correct rank passed in (and no need to do torch.cuda.set_device). This will ensure the correct GPUs are used for the training processes without any manual configurations.

Thanks for your reply.
Yes, you are right, I can lunch the script with CUDA_VISIBLE_DEVICES=0,1.
What I’m trying to do is to set the GPU id manually within the script so that I can distribute the model to more than one GPU with nn.parallel.DistributedDataParallel(model, device_ids=[**more than one**])
For example, on node two I want to set the GPU for [0,1] for processes with global rank [2,3] and allow model parallel on GPU [2,3]. But if I use CUDA_VISIBLE_DEVICES=0,1, both process can only see two GPU instead of four, I can only set device_ids=[0]. When I tried to use torch.cuda.set_device(local_rank), it has the problem like I described above.

Let me know if you need to know anything.

@Jing-Bi Can you check if your problem is resolved in the pytorch nightly release? This might be happening earlier due to a barrier() call in init_process_group which was removed in https://github.com/pytorch/pytorch/pull/49419.