Multi-node model parallelism with PyTorch

Hi @mrshenli I have been trying to solve this, and I think the closes issue someone has faced is here - Strange behaviour of GLOO tcp transport

I tried to do something similar to the code snippets in that issue. Below are my code snippets

On my EC2 instance (rank 0)

import torch.distributed as dist
import sys


if len(sys.argv) < 3:
    raise Exception("please enter host and port")

host = sys.argv[1]
port = sys.argv[2]
init_method = f"tcp://{host}:{port}"
print(f"init_method = {init_method}")

# on rank 0
dist.init_process_group(
    backend="gloo", init_method=init_method, rank=0, world_size=2
)

On my laptop (i.e. node of rank 1, and it is a Mac)

import torch.distributed as dist
import sys

if len(sys.argv) < 3:
    raise Exception("please enter host and port")

host = sys.argv[1]
port = sys.argv[2]
init_method = f"tcp://{host}:{port}"
print(f"init_method = {init_method}")

# on rank 1
dist.init_process_group(
    backend="gloo", init_method=init_method, rank=1, world_size=2
)

nothing complicated at all.

Now, on my ec2 instance (node of rank 0) I run the script with the following command -

python script.py localhost 50051

and on my laptop (node of rank1 - it is a Mac), I run the script with the following command

python script.py <Public IP of EC2 instance>  50051

The moment I do this, I get the following error on my EC2 instance -

File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper
    pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [/opt/conda/conda-bld/pytorch_1670525493953/work/third_party/gloo/gloo/transport/tcp/pair.cc:211] address family mismatch

and naturally, on my laptop, I get a Connection reset by peer error.

I do not know what to do. I have tried various permutations of the host when running the script on the ec2 instance - python script.py < Public IP of EC2 instance> 50051, or python script.py <Private IP of EC2 instance> 50051, I still get the same error the moment I run the command on my laptop to connect to the ec2 instance.

The error is coming from here in C++ code gloo/pair.cc at 4a5e339b764261d20fc409071dc7a8b8989aa195 · facebookincubator/gloo · GitHub, but the error message don’t provide any more information about the address family mismatch.

Why is this address mismatch happening, and are there any solutions to get around this?