pajenk
(pa jank)
June 15, 2022, 5:28am
1
Hi.
I’m trying to use two computers following the example:
https://pytorch.org/tutorials/intermediate/rpc_async_execution.html#batch-updating-parameter-server
First I tried very easy example, but I got an error.
On computer1 (let’s say, ip address: 1.1.1.1)
import os
import torch.distributed.rpc as rpc
os.environ['MASTER_ADDR'] = '2.2.2.2'
os.environ['MASTER_PORT'] = '7271'
rpc.init_rpc("worker1", rank=1, world_size=2)
rpc.shutdown()
On computer2 (let’s say, ip address : 2.2.2.2)
import os
import torch.distributed.rpc as rpc
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '7271'
rpc.init_rpc("worker0", rank=0, world_size=2)
rpc.shutdown()
Error:
RuntimeError: […/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [127.0.1.1]:14237: Connection refused
This error looks like the same as the error in the following link,
I am using the batch processing example for PS .
I put the PS and a worker on two machines. I changed the os.environ[‘MASTER_ADDR’] = ‘localhost’
os.environ[‘MASTER_PORT’] = ‘29500’ to the machine with rank 0.
PS runs and waiting for worker. But the worker send an error:
Traceback (most recent call last):
File “/home/skh018/PycharmProjects/test_DDL_pytorch/worker1.py”, line 148, in
run(1, world_size)
File “/home/skh018/PycharmProjects/test_DDL_pytorch/worker1.py”, line 124, in run
rpc.in…
but It seems that the problem has not been solved.
Is it necessary to set up additional network settings or include options in the code?
=======================================
In Computer 1 and Computer 2:
$ ifconfig
eno1: ~~~
lo: ~~~
# I did this.
$ sudo ufw allow 7271
H-Huang
(Howard Huang)
June 20, 2022, 8:05pm
2
Hi @pajenk , a couple of questions:
On computer2(ip address: 2.2.2.2) have you tried also settings the MASTER_ADDR as 2.2.2.2? e.g.
import os
import torch.distributed.rpc as rpc
os.environ['MASTER_ADDR'] = '2.2.2.2' # instead of localhost
os.environ['MASTER_PORT'] = '7271'
rpc.init_rpc("worker0", rank=0, world_size=2)
rpc.shutdown()
Gloo error means the error is likely happening during init_rpc, curious which version of pytorch are you using?
Not sure of additional networking settings, but the main thing is to confirm that the network address and port are accessible for each computer.
pajenk
(pa jank)
June 20, 2022, 8:59pm
3
Hi @H-Huang , Thanks for the questions !
Yes, I tried but the same error occurred.
I tried ‘1.11.0+cu113’ and ‘1.11.0+cu102’ on both computers using virtual environment (using python3 -m venv ~
).
Should the python version be the same? The computers have different versions (3.8 and 3.9).
When I run the code in only one of the two computers, there is no error (just waiting), and if I run the another code in another computer, then the error occurred. I think they are connected somehow.