Hanging on the torch.distributed.init_process_group

Flex · July 24, 2023, 5:37pm

Hi, I am trying to submit slurm job for Pytorch ddp. I use Facebook Code for configuring distributed training. The program just hangs on the function torch.distributed.init_process_group.The way I submit slurm job is that bash slurm.sh and slurm.sh is

sbatch -c 16 -n 1 --ntasks-per-node 2 --gres gpu:2 python train.py

But the slurm job will not hang on the function torch.distributed.init_process_group if I bypass slurm code part of Facebook Code. If I use python -m torch.distributed.launch --nproc_per_node=2 train.py and just call torch.distributed.init_process_group(init_method='env://', backend='nccl') without calling the following functions for submitting slurm jobs:

hostnames = subprocess.check_output(['scontrol', 'show', 'hostnames', os.environ['SLURM_JOB_NODELIST']])
params.master_addr = hostnames.split()[0].decode('utf-8')

Does code snippet above give wrong master address so ``torch.distributed.init_process_group` hangs? Code for reproduce is included in Facebook Code.

fduwjj · July 31, 2023, 6:22pm

Have you tried to verify that all ranks are using the same master_addr?

Flex · July 31, 2023, 6:35pm

I set os.environ['MASTER_ADDR'] = params.master_addr, I guess all ranks have the same master address? Should I print args.master_addr for double checking all ranks?

fduwjj · July 31, 2023, 6:58pm

That’s what I am asking for.