Hi, I am trying to submit slurm job for Pytorch ddp. I use Facebook Code for configuring distributed training. The program just hangs on the function torch.distributed.init_process_group
.The way I submit slurm job is that bash slurm.sh
and slurm.sh
is
sbatch -c 16 -n 1 --ntasks-per-node 2 --gres gpu:2 python train.py
But the slurm job will not hang on the function torch.distributed.init_process_group
if I bypass slurm code part of Facebook Code. If I use python -m torch.distributed.launch --nproc_per_node=2 train.py
and just call torch.distributed.init_process_group(init_method='env://', backend='nccl')
without calling the following functions for submitting slurm jobs:
hostnames = subprocess.check_output(['scontrol', 'show', 'hostnames', os.environ['SLURM_JOB_NODELIST']])
params.master_addr = hostnames.split()[0].decode('utf-8')
Does code snippet above give wrong master address so ``torch.distributed.init_process_group` hangs? Code for reproduce is included in Facebook Code.