I am running the following code on a workstation with two RTX 4090s:
import os
import torch
import torch.distributed as dist
# Set environment variables
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
# Initialize the distributed environment
dist.init_process_group(backend="nccl", world_size=2, rank=0)
The line:
dist.init_process_group(backend="nccl", world_size=2, rank=0)
causes the code to hang with no errors. When I set world_size=1 it does not hang. When I dug deeper I found that an instantiation of a member of the TCPStore class in the source causes the freeze. I ran this line in isolation on my end:
rank = 0 # This would vary for each process
is_master = (rank == 0)
world_size = 2
hostname = 'localhost' # Replace with the hostname of the master node
port = 12355
store = dist.TCPStore(hostname, port, world_size, is_master)
And it hangs in the same way.
Here is the output of nvidia-smi on my system:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 On | Off |
| 0% 50C P8 9W / 450W | 156MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:42:00.0 Off | Off |
| 0% 43C P8 8W / 450W | 13MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2029 G /usr/bin/gnome-shell 149MiB |
| 1 N/A N/A 2029 G /usr/bin/gnome-shell 6MiB |
+---------------------------------------------------------------------------------------+