Minimal example code:
import os
import torch.distributed as dist
os.environ['MASTER_ADDR'] = '148.251.86.243' # My master server IP
print(f"[ {os.getpid()} ] Initializing process group")
dist.init_process_group(backend="nccl")
print(f"[ {os.getpid()} ] world_size = {dist.get_world_size()}, " + f"rank = {dist.get_rank()}, backend={dist.get_backend()}")
-
The master address is hardcoded to my server’s IP:
os.environ['MASTER_ADDR'] = '148.251.86.243'
-
Launching with launch.py in master server hangs:
$ export NCCL_DEBUG=INFO ; export NCCL_DEBUG_SUBSYS=ALL ; python3 /home/dario/.local/lib/python3.6/site-packages/torch/distributed/launch.py --nnode=2 --node_rank=0 --nproc_per_node=1 msg3d/MSG3D/tuturial_DistributedDataParallel.py --local_world_size=1
[26236] Initializing process group with: {'MASTER_ADDR': '148.251.86.243', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '2'}
- And in second server it also hangs:
$ export NCCL_DEBUG=INFO ; export NCCL_DEBUG_SUBSYS=ALL ; python3 /home/dario/.local/lib/python3.6/site-packages/torch/distributed/launch.py --nnode=2 --node_rank=1 --nproc_per_node=1 msg3d/MSG3D/tuturial_DistributedDataParallel.py --local_world_size=1
[17186] Initializing process group with: {'MASTER_ADDR': '148.251.86.243', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '2'}
- No difference between
dist.init_process_group(backend="nccl")
and"gloo"
- No difference when setting
os.environ['NCCL_SOCKET_IFNAME'] = 'lo'
- Setting
export NCCL_DEBUG=INFO ; export NCCL_DEBUG_SUBSYS=ALL ;
produces no new output - No difference between PyTorch 1.7.1 and 1.8.0-rc4
- dist.init_process_group(backend=“nccl”, init_method=‘env://’) didn’t help
What else should I try?
Environment:
PyTorch version: 1.7.1 or 1.8.0.dev20210208+cu110
Is debug build: True
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.17.1
Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 11.2.142
GPU models and configuration: GPU 0: GeForce GTX 1080
Nvidia driver version: 460.32.03
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.18.4
[pip3] torch==1.7.1 or 1.8.0.dev20210208+cu110
[pip3] torchvision==0.8.1
[conda] Could not collect