How to run distributed training on my computer and server?

Maxwell_Albert · March 14, 2022, 6:14am

Currently I am in China and I could use vpn to establish ssh connection to my server. But I can not run dist.init on my server and computer to begin two machine training.
On client(my computer) I run,

import torch.distributed as dist
import torch
def main_worker(rank,nothing):
    dist.init_process_group(backend='gloo', init_method='tcp://ip_server:10000',
                            world_size=2, rank=rank)
    print(rank)
main_worker(1,0)

ip_server is 18.25.xxx.xxx
On server I run

import torch.distributed as dist
import torch
def main_worker(rank,nothing):
    dist.init_process_group(backend='gloo', init_method='tcp://ip_server::10000',
                            world_size=2, rank=rank)
    print(rank)
main_worker(0,0)

However, the mistake happens on my client, which is

  File "/Users/catbeta/opt/anaconda3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store
    return TCPStore(
RuntimeError: Connection reset by peer

So, how could I set connections between my server and my laptop?

huahuanZ · March 14, 2022, 11:22am

This seems an network connection issue, which can be complicated depending on whether your host IP and the port is available from outside and whether your laptop has access to that IP. I would recommend to do the network connection diagnosis first.

What do you want to do with such like DDP training? Communication overhead for far physical distance could be the bottleneck of training.

Maxwell_Albert · March 14, 2022, 12:11pm

Exactlly, the bandwidth will be small. However, I want to try to detect how much bandwidth I need in a bad situation.
I checked my server and I find these ports are available for outsiders. But I just can not connect by using dist.init package. Using ssh is available.
Could you give me some suggestion about how to detect and check the internet environment which may be helpful for this case?
Thanks!

rvarm1 · March 14, 2022, 10:34pm

Are you able to ssh into the server you are running training on?

In addition, we have added significant logging improvements to the TCPStore to enhance “connection reset by peer” type errors. Could you try out PyTorch nightly: Start Locally | PyTorch which should give much more comprehensive error messages that should help narrow down the problem?

huahuanZ · March 15, 2022, 3:58am

I have no idea about the issue if the connection is OK. Maybe you can try Rohan’s advice.

Have you try detect the accessibility of the port via telnet? i.e. in your local machine, run

telnet ip_of_remote_machine port

Maxwell_Albert · March 15, 2022, 1:09pm

Thanks for suggestion!
I can ssh into my server by using vscode or ssh terminal(through port 22).
And I also install pytorch nightly on both my laptop and my server. However, there are two kinds of results.

The same error on my laptop

Traceback (most recent call last):
  File "test.py", line 46, in <module>
    main_worker(1,0)
  File "test.py", line 18, in main_worker
    dist.init_process_group(backend='gloo', init_method='tcp://ip_addr(I hide this):10000',
  File "/Applications/anaconda3/envs/3.7/nightly/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/Applications/anaconda3/envs/3.7/nightly/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 186, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
  File "/Applications/anaconda3/envs/3.7/nightly/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 160, in _create_c10d_store
    return TCPStore(
RuntimeError: Connection reset by peer

It hangs infinitly both on server and laptop.(until timeout)

Should I use these codes(which is unhelpful)?

os.environ['GLOO_SOCKET_IFNAME'] = 'eth0'

Please give me some suggestion.

Maxwell_Albert · March 15, 2022, 1:10pm

Thanks!
But these ports on my machine are opening.

Maxwell_Albert · March 16, 2022, 1:56am

Any suggestions?
Has aanyone tried to use pytorch distributed across two machine from different reigon

shootingstar123 · May 17, 2022, 7:22am

Got the same issue recently, but I haven’t got a solution either.

pritamdamania87 · May 20, 2022, 10:01pm

Apart from the port passed in, GLOO uses other ephemeral ports to listen on that are passed to the client via the store. You really need to enable full connectivity across both nodes and not just selectively open a few ports.