How to leverage the world-size parameter for DistributedDataParallel in Pytorch example for multiple GPUs?

I am running this Pytorch example on a g2.2xlarge AWS machine. So, when I run time python ImageNet2, it runs well with the following timing:

real	3m16.253s
user	1m50.376s
sys	1m0.872s

However, when I add the world-size parameter, it gets stuck and does not execute anything. The command is as follows: time python --world-size 2 ImageNet2

So, how do I leverage the DistributedDataParallel functionality with the world-size parameter in this script. The world-size parameter is nothing but number of distributed processes.

Do I spin up another similar instance for this purpose? If yes, then how do the script recognize the instance? Do I need to add some parameters like the instance’s IP or something?

[Also asked the question on StackOverflow, if someone be willing to help there:]

Hi @Dawny33,

I recently play with distributed pytorch, and I can give you some pointers here, but I’m not sure if you’ve already figured this out.

I’m not sure if you can create multiple processes on a single machine by running init_process_group on different threads (it works in MPI) but you can try that out and let me know. However, you can definitely run distributed version of this example on a distributed cluster (with more than one EC2 instances).

To do that, the first thing you need to do is to set up your own cluster on EC2, which means you need to let all your worker nodes ssh-able from master node. Here is a good tutorial for doing this ( Then, if you look at the example code, it uses tcp to init the cluster, so you will need both to set the init_method to private address of your master node (with a self-defined port e.g. 23456), and to set the rank for each node in your cluster. Like it was described here (

After setting all these things up, run the code on each node (you may want to write a script to simplify this step), then you should get what you want.

Hope this one helps somehow.