Hi everyone, I am trying to train using DistributedDataParallel. Thanks to the great work of the team at PyTorch, a very high efficiency has been achieved. Everything is fine when a model is trained on a single node. However, when I try to use multiple nodes in one job script, all the processes will be on the host node and the slave node will not have any processes running on it. Here is my script for the PBS workload manager:
#!/bin/sh
#PBS -V
#PBS -q gpu
#PBS -N test_1e4_T=1
#PBS -l nodes=2:ppn=2
source /share/home/bjiangch/group-zyl/.bash_profile
conda activate Pytorch-181
cd $PBS_O_WORKDIR
path="/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/"
#Number of processes per node to launch
NPROC_PER_NODE=2
#Number of process in all modes
WORLD_SIZE=`expr $PBS_NUM_NODES \* $NPROC_PER_NODE`
MASTER=`/bin/hostname -s`
cat $PBS_NODEFILE>nodelist
#Make sure this node (MASTER) comes first
SLAVES=`cat nodelist | grep -v $MASTER | uniq`
#We want names of master and slave nodes
HOSTLIST="$MASTER $SLAVES"
#The path you place your code
#This command to run your pytorch script
#You will want to replace this
COMMAND="$path --world_size=$WORLD_SIZE"
#Get a random unused port on this host(MASTER)
#First line gets list of unused ports
#3rd line gets single random port from the list
MPORT=`ss -tan | awk '{print $5}' | cut -d':' -f2 | \
grep "[2-9][0-9]\{3,3\}" | sort | uniq | shuf -n 1`
#Launch the pytorch processes, first on master (first in $HOSTLIST) then on the slaves
RANK=0
for node in $HOSTLIST; do
ssh -q $node
python3 -m torch.distributed.launch \
--nproc_per_node=$NPROC_PER_NODE \
--nnodes=$PBS_NUM_NODES \
--node_rank=$RANK \
--master_addr="$MASTER" --master_port="$MPORT" \
$COMMAND &
RANK=$((RANK+1))
done
wait
It is modified according to the here.
I want to submit a 4 process work ( 2 nodes and 2 process each node).
For validation, I manually ssh to each node from the login node and execute the
ssh gpu1
python3 -m torch.distributed.launch --nnodes=2 --node_rank=0
ssh gpu2
python3 -m torch.distributed.launch --nnodes=2 --node_rank=1
It will work and has a pretty good parallel efficiency. The same problem will occur on another cluster with a slurm workload manager. I don’t see any difference between the two and lead to the totally different results. Any suggestions are welcome.
And the final error is
Traceback (most recent call last):
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,return _run_code(code, main_globals, None,
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
return _run_code(code, main_globals, None,
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
return _run_code(code, main_globals, None,
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/__main__.py", line 1, in <module>
exec(code, run_globals)
File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/__main__.py", line 1, in <module>
exec(code, run_globals)
File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/__main__.py", line 1, in <module>
exec(code, run_globals)
File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/__main__.py", line 1, in <module>
import run.train
File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/run/train.py", line 70, in <module>
import run.train
File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/run/train.py", line 70, in <module>
import run.train
File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/run/train.py", line 70, in <module>
import run.train
File "/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/run/train.py", line 70, in <module>
Prop_class = DDP(Prop_class, device_ids=[local_rank], output_device=local_rank)
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
Prop_class = DDP(Prop_class, device_ids=[local_rank], output_device=local_rank)
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
Prop_class = DDP(Prop_class, device_ids=[local_rank], output_device=local_rank)
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
Prop_class = DDP(Prop_class, device_ids=[local_rank], output_device=local_rank)
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
self._sync_params_and_buffers(authoritative_rank=0)
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
self._sync_params_and_buffers(authoritative_rank=0)
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
self._sync_params_and_buffers(authoritative_rank=0)
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
self._sync_params_and_buffers(authoritative_rank=0)
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
self._distributed_broadcast_coalesced(
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
self._distributed_broadcast_coalesced(
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
self._distributed_broadcast_coalesced(
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
dist._broadcast_coalesced(
dist._broadcast_coalesced(
RuntimeErrorRuntimeError: : NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/bin/python3', '-u', '/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/', '--local_rank=1', '--world_size=4']' returned non-zero exit status 1.
Traceback (most recent call last):
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/share/home/bjiangch/group-zyl/.conda/envs/Pytorch-181/bin/python3', '-u', '/share/home/bjiangch/group-zyl/zyl/pytorch/multi-GPU/program/eann/', '--local_rank=1', '--world_size=4']' returned non-zero exit status 1.