How to set init_method with MPI back end in distributed pytorch

zazzyy · September 5, 2017, 9:34pm

Hello everyone,

I’m trying to play with the distributed pytorch package with MPI back end, but faced some problem here.

I configured four AWS EC2 instances (one as master and others as workers) and made worker nodes ssh-able from master node. Then, on each nodes, I just simply run the following code:

import torch.distributed as dist
dist.init_process_group(backend='mpi', world_size=4)
print('Hello from process {} (out of {})!'.format(dist.get_rank(), dist.get_world_size()))

And on each node, I got output like

Hello from process 0 (out of 1)!

Can any person tell me how can I set the init_method interface in init_process_group to make this simple test code work? The expected output is something like:

Hello from process 0 (out of 4)!
Hello from process 1 (out of 4)!
Hello from process 2 (out of 4)!
Hello from process 3 (out of 4)!

Many thanks!

smth · September 30, 2017, 9:52pm

if you are using MPI backend, the world_size is set using the mpirun command.
You will run pytorch for example like:

mpirun -n 4 python my_script.py

zazzyy · October 2, 2017, 3:54pm

Thanks a lot for this pointer, it’s really helpful. Btw, I noticed that MPI initialization part in the doc of distributed pytorch is kind of “missing”. Do you need someone to help with that? If so, I can do it.

Kaixhin · October 2, 2017, 6:01pm

Yes, it was somewhat assumed that people planning to use MPI knew what they were doing a bit more, but having a PR with the docs filled out to match the other init methods would definitely be helpful!

lkywk · December 20, 2017, 11:56am

How to run MPI backend programer among nodes Your answer looks like for single node. I want to run the program among multiple nodes, your answer cannot work.

Can u help me? Thank you

zazzyy · December 20, 2017, 4:58pm

basically you just need to follow how to launch a job on multiple IPs in MPI.

here is a good tutorial to do that: http://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/
, which basically tells you that you need to prepare a hostfile with private addresses of all nodes in your cluster. Then when you launch your job from the master node, you need to provide appropriate dir of the hostfile

Hope this helps