I’m trying to play with the distributed pytorch package with MPI back end, but faced some problem here.
I configured four AWS EC2 instances (one as master and others as workers) and made worker nodes ssh-able from master node. Then, on each nodes, I just simply run the following code:
import torch.distributed as dist
dist.init_process_group(backend='mpi', world_size=4)
print('Hello from process {} (out of {})!'.format(dist.get_rank(), dist.get_world_size()))
And on each node, I got output like
Hello from process 0 (out of 1)!
Can any person tell me how can I set the init_method interface in init_process_group to make this simple test code work? The expected output is something like:
Hello from process 0 (out of 4)!
Hello from process 1 (out of 4)!
Hello from process 2 (out of 4)!
Hello from process 3 (out of 4)!
Thanks a lot for this pointer, it’s really helpful. Btw, I noticed that MPI initialization part in the doc of distributed pytorch is kind of “missing”. Do you need someone to help with that? If so, I can do it.
Yes, it was somewhat assumed that people planning to use MPI knew what they were doing a bit more, but having a PR with the docs filled out to match the other init methods would definitely be helpful!
How to run MPI backend programer among nodes Your answer looks like for single node. I want to run the program among multiple nodes, your answer cannot work.
basically you just need to follow how to launch a job on multiple IPs in MPI.
here is a good tutorial to do that: http://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/
, which basically tells you that you need to prepare a hostfile with private addresses of all nodes in your cluster. Then when you launch your job from the master node, you need to provide appropriate dir of the hostfile