Using rpc on two computers

Hi.
I’m trying to use two computers following the example:
https://pytorch.org/tutorials/intermediate/rpc_async_execution.html#batch-updating-parameter-server

First I tried very easy example, but I got an error.
On computer1 (let’s say, ip address: 1.1.1.1)

import os
import torch.distributed.rpc as rpc

os.environ['MASTER_ADDR'] = '2.2.2.2'
os.environ['MASTER_PORT'] = '7271'
rpc.init_rpc("worker1", rank=1, world_size=2)
rpc.shutdown()

On computer2 (let’s say, ip address : 2.2.2.2)

import os
import torch.distributed.rpc as rpc

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '7271'
rpc.init_rpc("worker0", rank=0, world_size=2)
rpc.shutdown()

Error:

RuntimeError: […/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [127.0.1.1]:14237: Connection refused

This error looks like the same as the error in the following link,

but It seems that the problem has not been solved.

Is it necessary to set up additional network settings or include options in the code?

=======================================
In Computer 1 and Computer 2:

$ ifconfig
eno1: ~~~
lo: ~~~

# I did this.
$ sudo ufw allow 7271

Hi @pajenk, a couple of questions:

  1. On computer2(ip address: 2.2.2.2) have you tried also settings the MASTER_ADDR as 2.2.2.2? e.g.
import os
import torch.distributed.rpc as rpc

os.environ['MASTER_ADDR'] = '2.2.2.2' # instead of localhost
os.environ['MASTER_PORT'] = '7271'
rpc.init_rpc("worker0", rank=0, world_size=2)
rpc.shutdown()
  1. Gloo error means the error is likely happening during init_rpc, curious which version of pytorch are you using?

  2. Not sure of additional networking settings, but the main thing is to confirm that the network address and port are accessible for each computer.

Hi @H-Huang, Thanks for the questions !

  1. Yes, I tried but the same error occurred.
  2. I tried ‘1.11.0+cu113’ and ‘1.11.0+cu102’ on both computers using virtual environment (using python3 -m venv ~).
    Should the python version be the same? The computers have different versions (3.8 and 3.9).
  3. When I run the code in only one of the two computers, there is no error (just waiting), and if I run the another code in another computer, then the error occurred. I think they are connected somehow.