Why does init_process_group hang with world size > 1?

I am running the following code on a workstation with two RTX 4090s:

import os
import torch
import torch.distributed as dist

# Set environment variables
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'

# Initialize the distributed environment
dist.init_process_group(backend="nccl", world_size=2, rank=0)

The line:

dist.init_process_group(backend="nccl", world_size=2, rank=0)

causes the code to hang with no errors. When I set world_size=1 it does not hang. When I dug deeper I found that an instantiation of a member of the TCPStore class in the source causes the freeze. I ran this line in isolation on my end:

    rank = 0  # This would vary for each process
    is_master = (rank == 0)
    world_size = 2
    hostname = 'localhost'  # Replace with the hostname of the master node
    port = 12355
    store = dist.TCPStore(hostname, port, world_size, is_master)

And it hangs in the same way.

Here is the output of nvidia-smi on my system:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:41:00.0  On |                  Off |
|  0%   50C    P8               9W / 450W |    156MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:42:00.0 Off |                  Off |
|  0%   43C    P8               8W / 450W |     13MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2029      G   /usr/bin/gnome-shell                        149MiB |
|    1   N/A  N/A      2029      G   /usr/bin/gnome-shell                          6MiB |
+---------------------------------------------------------------------------------------+

I remember reading about some issues with IOMMU and p2p communication on RTX4090 cards. Can you try the solution here `torch.distributed.init_process_group` hangs with 4 gpus with `backend="NCCL"` but not `"gloo"` - #6 by ritwik.m07 ?

I actually tried this solution and it didn’t work. I also ran this code on a workstation with two A6000s and I’m getting the same behavior where it hangs

1 Like

If I understand this correctly, rank=0 actually needs to be rank=actual rank?