Currently I am in China and I could use vpn to establish ssh connection to my server. But I can not run dist.init on my server and computer to begin two machine training.
On client(my computer) I run,
However, the mistake happens on my client, which is
File "/Users/catbeta/opt/anaconda3/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store
return TCPStore(
RuntimeError: Connection reset by peer
So, how could I set connections between my server and my laptop?
This seems an network connection issue, which can be complicated depending on whether your host IP and the port is available from outside and whether your laptop has access to that IP. I would recommend to do the network connection diagnosis first.
What do you want to do with such like DDP training? Communication overhead for far physical distance could be the bottleneck of training.
Exactlly, the bandwidth will be small. However, I want to try to detect how much bandwidth I need in a bad situation.
I checked my server and I find these ports are available for outsiders. But I just can not connect by using dist.init package. Using ssh is available.
Could you give me some suggestion about how to detect and check the internet environment which may be helpful for this case?
Thanks!
Are you able to ssh into the server you are running training on?
In addition, we have added significant logging improvements to the TCPStore to enhance “connection reset by peer” type errors. Could you try out PyTorch nightly: Start Locally | PyTorch which should give much more comprehensive error messages that should help narrow down the problem?
Thanks for suggestion!
I can ssh into my server by using vscode or ssh terminal(through port 22).
And I also install pytorch nightly on both my laptop and my server. However, there are two kinds of results.
The same error on my laptop
Traceback (most recent call last):
File "test.py", line 46, in <module>
main_worker(1,0)
File "test.py", line 18, in main_worker
dist.init_process_group(backend='gloo', init_method='tcp://ip_addr(I hide this):10000',
File "/Applications/anaconda3/envs/3.7/nightly/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/Applications/anaconda3/envs/3.7/nightly/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 186, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/Applications/anaconda3/envs/3.7/nightly/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 160, in _create_c10d_store
return TCPStore(
RuntimeError: Connection reset by peer
It hangs infinitly both on server and laptop.(until timeout)
Apart from the port passed in, GLOO uses other ephemeral ports to listen on that are passed to the client via the store. You really need to enable full connectivity across both nodes and not just selectively open a few ports.