[E1124 16:39:09.085084260 socket.cpp:1019] [c10d] The client socket has timed out after 60000ms while trying to connect to (<IP>, <PORT>).
[W1124 16:39:09.085392878 TCPStore.cpp:340] [c10d] TCP client failed to connect/validate to host .1:29500 - retrying (try=0, timeout=60000ms, delay=25938ms): The client socket has timed out after 60000ms while trying to connect to (<IP>, <PORT>).
(…)
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
Hello,
Here’s a clean checklist to fix the TCPStore timeout issue when using torchrun across two machines over VPN. Your symptoms indicate that the rendezvous (TCPStore) port is reachable via netcat, but PyTorch cannot complete the rendezvous handshake — this almost always comes down to config mistakes, port binding, hostnames, or duplicate node-rank logic.