How to correctly launch the DDP in multiple nodes

The code can be launched in one node with multiple process correctly. However, when I try to launch the same code with multiple nodes. It will fail with the following error.

Traceback (most recent call last):
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/__main__.py", line 1, in <module>
    import run.train
  File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/run/train.py", line 3, in <module>
    from src.read import *
  File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/src/read.py", line 275, in <module>
    dist.init_process_group(backend=DDP_backend)
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Permission denied
Traceback (most recent call last):
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/bin/python3', '-u', '/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/', '--local_rank=1']' returned non-zero exit status 1.

Here is my scripts

python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0  --master_addr=gpu1 --master_port=22 /share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/
python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1  --master_addr=gpu1 --master_port=22 /share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/

The “gpu1” is my hostname and I have tried to replace the hostname with IP.
Thanks in advance for any kind help.

I think the problem is that you are using port 22 for master_port. Port 22 is reserved for SSH and usually ports 0-1023 are system ports for which you need root access (probably that is why you see Permission Denied). I’d suggest using a port number > 1024 and ensure no other service is supposed to use that port number.

It works! Thanks for your kind help.

This can only work when I manually log in the every compute node involved and execute the directive in every compute node

python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0  --master_addr=gpu1 --master_port=1027 /share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/ >out

However, it is very inconvenient to do this in a cluster-management system. Do you have any idea to submit this by a general script in cluster-management system?

What sort of cluster management system do you have? Such integrations usually rely on the kind of cluster management systems you use. For example there is a pytoch kubernetes operator: GitHub - kubeflow/pytorch-operator: PyTorch on Kubernetes