Problem when connecting two machines using distributed pytorch

I’m new using Pytorch distributed and I want to connect two machines in the same network using TCP, but nothing happens when try to connect, it just hangs up.

The code of master machine with IP 192.168.43.51.

import os
import torch
import torch.distributed as dist

def setup():
    os.environ['MASTER_ADDR'] = '192.168.43.51'
    os.environ['MASTER_PORT'] = '8000'
    os.environ['RANK'] = '0'
    os.environ['WORLD_SIZE'] = '2'

    # initialize the process group
    dist.init_process_group(backend='gloo', init_method='tcp://192.168.43.51:8000',rank=0,world_size=2)
    print("Connected ...")

if __name__ == "__main__":
	setup()

The execution command is:

python3 distributed_master.py

The code of the machine that try to connect with the master (IP: 192.168.43.240):

import os
import torch
import torch.distributed as dist

def setup():
    os.environ['MASTER_ADDR'] = '192.168.43.51'
    os.environ['MASTER_PORT'] = '8000'
    os.environ['RANK'] = '1'
    os.environ['WORLD_SIZE'] = '2'

    # initialize the process group
    dist.init_process_group(backend='gloo', init_method='tcp://192.168.43.51:8000',rank=1,world_size=2)
    print("Connected ...")

if __name__ == "__main__":
	setup()

The execution command is:

python3 distributed_client.py

First I run the master and then the client but there is no response, it seems to hang on the init_process_group method.

Can someone help me?
Thanks!

Try setting your init_method to env:// which just tells the processes to look in the environment variables to find other processes.

I tried your answer but the result is the same.

Hi, I solved my problem by specifying in both the client and master code the variables:

Master node:

- GLOO_SOCKET_IFNAME = <master network interface name>
- TP_SOCKET_FINAME = <master network interface name>

The following rule must be introduced to the firewall: sudo ufw allow from <client IP>

Client node:

- GLOO_SOCKET_IFNAME = <client network interface name>
- TP_SOCKET_FINAME = <client network interface name>

The following rule must be introduced to the firewall: sudo ufw allow from <master IP>