Init_process_group with launch.py --nnode=2 hangs always in all machines

Minimal example code:

import os
import torch.distributed as dist

os.environ['MASTER_ADDR'] = '148.251.86.243' # My master server IP
print(f"[ {os.getpid()} ] Initializing process group")
dist.init_process_group(backend="nccl")
print(f"[ {os.getpid()} ] world_size = {dist.get_world_size()}, " + f"rank = {dist.get_rank()}, backend={dist.get_backend()}")
  • The master address is hardcoded to my server’s IP: os.environ['MASTER_ADDR'] = '148.251.86.243'

  • Launching with launch.py in master server hangs:

$ export NCCL_DEBUG=INFO ; export NCCL_DEBUG_SUBSYS=ALL ; python3 /home/dario/.local/lib/python3.6/site-packages/torch/distributed/launch.py --nnode=2 --node_rank=0 --nproc_per_node=1 msg3d/MSG3D/tuturial_DistributedDataParallel.py --local_world_size=1
[26236] Initializing process group with: {'MASTER_ADDR': '148.251.86.243', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '2'}
  • And in second server it also hangs:
$ export NCCL_DEBUG=INFO ; export NCCL_DEBUG_SUBSYS=ALL ; python3 /home/dario/.local/lib/python3.6/site-packages/torch/distributed/launch.py --nnode=2 --node_rank=1 --nproc_per_node=1 msg3d/MSG3D/tuturial_DistributedDataParallel.py --local_world_size=1
[17186] Initializing process group with: {'MASTER_ADDR': '148.251.86.243', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '2'}

What else should I try?

Environment:

PyTorch version: 1.7.1 or 1.8.0.dev20210208+cu110
Is debug build: True
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.17.1

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 11.2.142
GPU models and configuration: GPU 0: GeForce GTX 1080
Nvidia driver version: 460.32.03
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.18.4
[pip3] torch==1.7.1 or 1.8.0.dev20210208+cu110
[pip3] torchvision==0.8.1
[conda] Could not collect

It looks like you’re using two processes but setting WORLD_SIZE to 1, I think setting WORLD_SIZE to 2 should fix this issue.

Good spot!

But now it hangs on both =/, editted above

$ python3 /home/dario/.local/lib/python3.6/site-packages/torch/distributed/launch.py --nnode=2 --node_rank=0 --nproc_per_node=1 msg3d/MSG3D/tuturial_DistributedDataParallel.py --local_world_size=1

[23616] Initializing process group with: {'MASTER_ADDR': '148.251.86.243', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '2'}

Continuing to try debug this

  • Setting export NCCL_DEBUG=INFO ; export NCCL_DEBUG_SUBSYS=ALL ; produces no new output

Editted in above

Next attempt:
torch-1.8.0.dev20210208%2Bcu110-cp36-cp36m-linux_x86_64.whl

With
backend="nccl"
NCCL_DEBUG: INFO
NCCL_DEBUG_SUBSYS: ALL
Absolutely no change, dist.init_process_group(backend="nccl") hangs in both machines and no extra output is given

Deleted down to a minimal example:

import os
import torch.distributed as dist

os.environ['MASTER_ADDR'] = '148.251.86.243'
print(f"[ {os.getpid()} ] Initializing process group")
dist.init_process_group(backend="nccl")
print(f"[ {os.getpid()} ] world_size = {dist.get_world_size()}, " + f"rank = {dist.get_rank()}, backend={dist.get_backend()}")

editted above

  • dist.init_process_group(backend=“nccl”, init_method=‘env://’) didn’t help

Created issue init_process_group with launch.py --nnode=2 hangs always in all machines · Issue #52848 · pytorch/pytorch · GitHub

Ok, managed to open a port (29500) that allows init_process_group to run correctly.

In my firewall settings it looks like

Source IP | Destination IP | Source port | Destination port | Protocol | TCP flags | Action 
          |                |             |            29500 |      tcp |           | accept

After master process is running (that starts listening on 29500), then doing telnet 148.251.86.243 29500 says

Trying 148.251.86.243...
Connected to gpu14.
Escape character is '^]'.

confirming the port is correctly open.

And running the minimal example above on the second server makes init_process_group return correctly and print my second print above
[ 8172 ] world_size = 2, rank = 0, backend=nccl