Hi
I’m experiencing an issue where distributed models using torch.distributed.launch and distributeddataparallel hang specifically for NCCL Multi-GPU Multi-Node training, but work fine for Single-GPU Multi-Node and Multi-Node, Single-GPU training, and was wondering if anyone else had experienced such an issue?
In the specific case of Multi-GPU Multi-Node, all GPU’s are loaded with models (as in, nvidia-smi reports GPU memory utilisation), but at reaching distributeddataparallel NCCL_DEBUG reports
“SECONDARY_ADDR:6582:6894 [1] NCCL INFO Call to connect returned Connection timed out, retrying
SECONDARY_ADDR:6581:6895 [0] NCCL INFO Call to connect returned Connection timed out, retrying” on the rank 1 variant running on device SECONDARY_ADDR.
But for both single-node/multi-gpu and multi-node/single-gpu, the code proceeds past distributeddataparallel without any issues, which is what is making this particularly perplexing.
Job is being run via slurm using torch 1.8.1+cu111 and nccl/2.8.3-cuda-11.1.1.
Key implementation details are as follows.
The batch script used to run the code has the key details:
export NPROCS_PER_NODE=2 # GPUs per node
export WORLD_SIZE=2 # Total nodes (total ranks are GPUs*World Size
…
RANK=0
for node in $HOSTLIST; do
ssh $node "
module load nccl/2.8.3-cuda-11.1.1
python3 -m torch.distributed.launch --nproc_per_node=$NPROCS_PER_NODE –
nnodes=$WORLD_SIZE --node_rank=$RANK --master_addr=$MASTER_ADDR -
master_port=$MASTER_PORT test.py > test_$RANK" &
RANK=$((RANK+1))
done
wait
The above is the multi-node multi-gpu configuration. For single-node multi-gpu it is modified so that NPROCS_PER_NODE=2, WORLD_SIZE=1; while multi-node single gpu is NPROCS_PER_NODE=1, WORLD_SIZE=2.
Key details of test.py are
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument(“–local_rank”, type=int, help=“Local rank. Necessary for using the torch.distributed.launch utility.”)
arg = parser.parse_args()local_rank = arg.local_rank
torch.cuda.set_device(arg.local_rank)torch.distributed.init_process_group(backend=‘nccl’, init_method=‘env://’)
…
model = model.cuda() #to(device)
ddp_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
…
train_sampler = DistributedSampler(dataset=train_set)
…
While torch.distributed.launch has recently been depreciated and replaced with.elastic_launch, moving to elastic_launch as a potential solution does not seem viable, due to the dependence on etcd which I’m unable to install due to access privilege restrictions.
If anyone had any suggestions about how to resolve this, I would greatly appreciate your input.
Thanks