Unable to init imagenet distributed example

rahul003 · August 14, 2018, 12:24am

I’ve spent so much time trying to get distributed training to work. Please help. I am trying to use the standard Imagenet example from here https://github.com/pytorch/examples/tree/master/imagenet

I wrote a bash script to launch all the processes on each machine (this snippet takes in the 8-hosts from a file).

i=0
while read -u 10 host; 
do
	host=${host%% slots*}
	echo $host
	echo $i
	let "i += 1"
	ssh -o "StrictHostKeyChecking no" $host 'tmux new-session -d "/usr/bin/python3 efs/pytorch/pytorch/torch/distributed/launch.py --nproc_per_node=8 --nnodes=7 --master_addr=172.31.36.234 --master_port=1234 pytorch.py --dtype float32 --dummy-data 1281167 --arch resnet50 -b 128 --world-size 8 --dist-backend nccl --dist-url "env://" --rank '$i' /home/ubuntu/"';
done 10<"8-hosts"

It just hangs when I run this after outputting

pytorch.py:97: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism.
  warnings.warn('You have chosen a specific GPU. This will completely '

I’m trying to use 8 P3.16xlarge instances from AWS with NCCL backend. Please help me debug this. I see so many requests for this issue, but no solutions. I’ve read your tutorial on distributed training, but even that fails to give an end-to-end example along with the bash command to start training.

k0pch4 · March 17, 2019, 8:42pm

Hi, even I am not sure how we can use distributed learning code that has been provided in the pytorch examples.