I’ve spent so much time trying to get distributed training to work. Please help. I am trying to use the standard Imagenet example from here https://github.com/pytorch/examples/tree/master/imagenet
I wrote a bash script to launch all the processes on each machine (this snippet takes in the 8-hosts from a file).
i=0
while read -u 10 host;
do
host=${host%% slots*}
echo $host
echo $i
let "i += 1"
ssh -o "StrictHostKeyChecking no" $host 'tmux new-session -d "/usr/bin/python3 efs/pytorch/pytorch/torch/distributed/launch.py --nproc_per_node=8 --nnodes=7 --master_addr=172.31.36.234 --master_port=1234 pytorch.py --dtype float32 --dummy-data 1281167 --arch resnet50 -b 128 --world-size 8 --dist-backend nccl --dist-url "env://" --rank '$i' /home/ubuntu/"';
done 10<"8-hosts"
It just hangs when I run this after outputting
pytorch.py:97: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism.
warnings.warn('You have chosen a specific GPU. This will completely '
I’m trying to use 8 P3.16xlarge instances from AWS with NCCL backend. Please help me debug this. I see so many requests for this issue, but no solutions. I’ve read your tutorial on distributed training, but even that fails to give an end-to-end example along with the bash command to start training.