How to submit a DDP job on the PBS/SLURM on multiple nodes

Hi everyone, I am trying to train using DistributedDataParallel. Thanks to the great work of the team at PyTorch, a very high efficiency has been achieved. Everything is fine when a model is trained on a single node. However, when I try to use multiple nodes in one job script, all the processes will be on the host node and the slave node will not have any processes running on it. Here is my script for the PBS workload manager:

#!/bin/sh
#PBS -V
#PBS -q gpu
#PBS -N test_1e4_T=1
#PBS -l nodes=2:ppn=2
source /share/home/bjiangch/group-zyl/.bash_profile
conda activate Pytorch-181
cd $PBS_O_WORKDIR

path="/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/"

#Number of processes per node to launch
NPROC_PER_NODE=2

#Number of process in all modes
WORLD_SIZE=`expr $PBS_NUM_NODES \* $NPROC_PER_NODE`

MASTER=`/bin/hostname -s`
cat $PBS_NODEFILE>nodelist
#Make sure this node (MASTER) comes first
SLAVES=`cat nodelist | grep -v $MASTER | uniq`

#We want names of master and slave nodes
HOSTLIST="$MASTER $SLAVES"


#The path you place your code
#This command to run your pytorch script
#You will want to replace this
COMMAND="$path --world_size=$WORLD_SIZE"


#Get a random unused port on this host(MASTER)
#First line gets list of unused ports
#3rd line gets single random port from the list
MPORT=`ss -tan | awk '{print $5}' | cut -d':' -f2 | \
        grep "[2-9][0-9]\{3,3\}" | sort | uniq | shuf -n 1`


#Launch the pytorch processes, first on master (first in $HOSTLIST) then on the slaves
RANK=0
for node in $HOSTLIST; do
        ssh -q $node
                python3 -m torch.distributed.launch \
                --nproc_per_node=$NPROC_PER_NODE \
                --nnodes=$PBS_NUM_NODES \
                --node_rank=$RANK \
                --master_addr="$MASTER" --master_port="$MPORT" \
                $COMMAND &
        RANK=$((RANK+1))
done
wait

It is modified according to the here.
I want to submit a 4 process work ( 2 nodes and 2 process each node).
For validation, I manually ssh to each node from the login node and execute the
ssh gpu1
python3 -m torch.distributed.launch --nnodes=2 --node_rank=0
ssh gpu2
python3 -m torch.distributed.launch --nnodes=2 --node_rank=1

It will work and has a pretty good parallel efficiency. The same problem will occur on another cluster with a slurm workload manager. I don’t see any difference between the two and lead to the totally different results. Any suggestions are welcome.
And the final error is

Traceback (most recent call last):
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,return _run_code(code, main_globals, None,

  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
    return _run_code(code, main_globals, None,
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
    return _run_code(code, main_globals, None,
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/__main__.py", line 1, in <module>
    exec(code, run_globals)
  File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/__main__.py", line 1, in <module>
    exec(code, run_globals)
  File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/__main__.py", line 1, in <module>
    exec(code, run_globals)
  File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/__main__.py", line 1, in <module>
    import run.train
  File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/run/train.py", line 70, in <module>
    import run.train
  File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/run/train.py", line 70, in <module>
    import run.train
  File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/run/train.py", line 70, in <module>
    import run.train
  File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/run/train.py", line 70, in <module>
    Prop_class = DDP(Prop_class, device_ids=[local_rank], output_device=local_rank)
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
    Prop_class = DDP(Prop_class, device_ids=[local_rank], output_device=local_rank)
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
    Prop_class = DDP(Prop_class, device_ids=[local_rank], output_device=local_rank)
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
    Prop_class = DDP(Prop_class, device_ids=[local_rank], output_device=local_rank)
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
    self._distributed_broadcast_coalesced(
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
    self._distributed_broadcast_coalesced(
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
    self._distributed_broadcast_coalesced(
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
    dist._broadcast_coalesced(
    dist._broadcast_coalesced(
RuntimeErrorRuntimeError: : NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

    dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/bin/python3', '-u', '/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/', '--local_rank=1', '--world_size=4']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/bin/python3', '-u', '/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/', '--local_rank=1', '--world_size=4']' returned non-zero exit status 1.

My understanding is that when you run the ssh commands manually they work, but the script that does essentially the same seems to be failing?

If so, could you print out the ssh commands from the script and run those manually to check if that works? Could you also share the ssh commands printed out by the script? It could possibly be a bug in the script.

Thank you for your response. Acctually I have manually launched this directive “python3 -m torch.distributed …”. It can work on the PBS management system. However, it still fails in the slurm system. I have tried to use a different version of PyTorch (1.9.0) and the error suggests using torch.distributed.run instead of torch.distributed.launch. And the launch script seems to be simpler than that of before. It may not need to ssh to each node?(can not determine?)Still, that can work on PBS and fails in slurm with the following error

  1 [INFO] 2021-07-10 16:51:24,635 run: Running torch.distributed.run with args: ['/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/pytho    n3.9/site-packages/torch/distributed/run.py', '--nproc_per_node=2', '--nnodes=1', '--rdzv_id=1050201', '--rdzv_backend=c10d', '--rd    zv_endpoint=gnode09:6818', '/home/chp/bjiangch/zyl/2021_0705/program/eann/']
  2 [INFO] 2021-07-10 16:51:24,641 run: Using nproc_per_node=2.
  3 [INFO] 2021-07-10 16:51:24,641 api: Starting elastic_operator with launch configs:
  4   entrypoint       : /home/chp/bjiangch/zyl/2021_0705/program/eann/
  5   min_nodes        : 1
  6   max_nodes        : 1
  7   nproc_per_node   : 2
  8   run_id           : 1050201
  9   rdzv_backend     : c10d
 10   rdzv_endpoint    : gnode09:6818
 11   rdzv_configs     : {'timeout': 900}
 12   max_restarts     : 3
 13   monitor_interval : 5
 14   log_dir          : None
 15   metrics_cfg      : {}
 16 
 17 terminate called after throwing an instance of 'std::system_error'
 18   what():  Connection reset by peer
 19 Fatal Python error: Aborted
 20 
 21 Thread 0x00002b644a6b7a00 (most recent call first):
 22   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous    _backend.py", line 103 in _call_store
 23   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous    _backend.py", line 54 in __init__
 24   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous    _backend.py", line 206 in create_backend
 25   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/registry.py", l    ine 35 in _create_c10d_handler
 26   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/api.py", line 2    53 in create_handler
 27   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/registry.py", l    ine 64 in get_rendezvous_handler
 28   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 214 in laun    ch_agent
 29   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__i    nit__.py", line 348 in wrapper
 30   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116 in __ca    ll__
 31   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/site-packages/torch/distributed/run.py", line 621 in run
 32   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/site-packages/torch/distributed/run.py", line 629 in main
 33   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/site-packages/torch/distributed/run.py", line 637 in <module>
 34   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/runpy.py", line 87 in _run_code
 35   File "/home/chp/bjiangch/.conda/envs/PyTorch-190/lib/python3.9/runpy.py", line 197 in _run_module_as_main
 36 /var/spool/slurmd/job1050201/slurm_script: line 46: 17544 Aborted                 python -m torch.distributed.run --nproc_per_node=    $NPROC_PER_NODE --nnodes=$SLURM_JOB_NUM_NODES --rdzv_id=$SLURM_JOB_ID --rdzv_backend=c10d --rdzv_endpoint=$MASTER:$MPORT $COMMAND >     out

Next is my script

  1 #!/bin/sh
  2 #SBATCH -J 1e5-N-T=1
  3 #SBATCH -p GPU-V100
  4 #SBATCH --qos=gpujoblimit
  5 ##SBATCH --qos=qos_a100_gpu
  6 #SBATCH --gres=gpu:2
  7 #SBATCH --nodes=1
  8 #SBATCH --ntasks-per-node=2 --cpus-per-task=20
  9 #SBATCH --gres-flags=enforce-binding
 10 #SBATCH -o %x.o%j
 11 #SBATCH -e %x.e%j
 12 echo Running on hosts
 13 echo Time is `date`
 14 echo Directory is $PWD
 15 echo This job runs on the following nodes:
 16 echo $SLURM_JOB_NODELIST
 17 # Your conda environment
 18 conda_env=PyTorch-190
 19 
 20 #ATTENTION! HERE MUSTT BE ONE LINE,OR ERROR!
 21 source ~/.bashrc
 22 
 23 module add cuda/11.1
 24 module add /opt/Modules/python/anaconda3
 25 #module add cudnn/7.6.5.32_cuda10.2
 26 conda activate $conda_env
 27 cd $PWD
 28 
 29 
 30 #Number of processes per node to launch (20 for CPU, 2 for GPU)
 31 NPROC_PER_NODE=2
 32 
 33 #The path you place your code
 34 path="/home/chp/bjiangch/zyl/2021_0705/program/eann/"
 35 #This command to run your pytorch script
 36 #You will want to replace this
 37 COMMAND="$path"
 38 
 39 #We want names of master and slave nodes
 40 MASTER=`/bin/hostname -s`
 41 
 42 MPORT=`ss -tan | awk '{print $4}' | cut -d':' -f2 | \
 43       grep "[2-9][0-9]\{3,3\}" | grep -v "[0-9]\{5,5\}" | \
 44       sort | uniq | shuf`
 45 
 46 python -m torch.distributed.run --nproc_per_node=$NPROC_PER_NODE --nnodes=$SLURM_JOB_NUM_NODES --rdzv_id=$SLURM_JOB_ID --rdzv_backe    nd=c10d --rdzv_endpoint=$MASTER:$MPORT $COMMAND >out

Many thanks for your kind help!