I’m new using Pytorch distributed and I want to connect two machines in the same network using TCP, but nothing happens when try to connect, it just hangs up.
The code of master machine with IP 192.168.43.51.
import os
import torch
import torch.distributed as dist
def setup():
os.environ['MASTER_ADDR'] = '192.168.43.51'
os.environ['MASTER_PORT'] = '8000'
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '2'
# initialize the process group
dist.init_process_group(backend='gloo', init_method='tcp://192.168.43.51:8000',rank=0,world_size=2)
print("Connected ...")
if __name__ == "__main__":
setup()
The execution command is:
python3 distributed_master.py
The code of the machine that try to connect with the master (IP: 192.168.43.240):
import os
import torch
import torch.distributed as dist
def setup():
os.environ['MASTER_ADDR'] = '192.168.43.51'
os.environ['MASTER_PORT'] = '8000'
os.environ['RANK'] = '1'
os.environ['WORLD_SIZE'] = '2'
# initialize the process group
dist.init_process_group(backend='gloo', init_method='tcp://192.168.43.51:8000',rank=1,world_size=2)
print("Connected ...")
if __name__ == "__main__":
setup()
The execution command is:
python3 distributed_client.py
First I run the master and then the client but there is no response, it seems to hang on the init_process_group method.
Can someone help me?
Thanks!