Node 0 cannot connect to itself

gabrieldlm · November 24, 2025, 4:41pm

Hi,

I’m trying to run torchrun in 2 different machines over the VPN (the machines can communicate as tested via netcat)

My run command is the following:

export TARGETIP=<IP>
export TARGETPORT=<PORT>
case $HOSTNAME in
    (HOSTNAME1)    export NODE=0 ;;
    (HOSTNAME1)    export NODE=1 ;;
    (\*)           echo "ERROR - Unknown HOSTNAME: $HOSTNAME"; exit 1 ;;
esac
export OMP_NUM_THREADS=1
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=$NODE --rdzv_backend=c10d \
         --rdzv_endpoint=$TARGETIP:$TARGETPORT -m train

However, on node 0 (the host) I got the error

[E1124 16:39:09.085084260 socket.cpp:1019] [c10d] The client socket has timed out after 60000ms while trying to connect to (<IP>, <PORT>).
[W1124 16:39:09.085392878 TCPStore.cpp:340] [c10d] TCP client failed to connect/validate to host .1:29500 - retrying (try=0, timeout=60000ms, delay=25938ms): The client socket has timed out after 60000ms while trying to connect to (<IP>, <PORT>).
(…)
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

How could I solve this ?

firafaj934 · November 25, 2025, 6:17am

gabrieldlm:

’m trying to run torchrun in 2 different machines over the VPN (the machines can communicate as tested via netcat)

My run command is the following:

export TARGETIP=<IP>
export TARGETPORT=<PORT>
case $HOSTNAME in
    (HOSTNAME1)    export NODE=0 ;;
    (HOSTNAME1)    export NODE=1 ;;
    (\*)           echo "ERROR - Unknown HOSTNAME: $HOSTNAME"; exit 1 ;;
esac
export OMP_NUM_THREADS=1
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=$NODE --rdzv_backend=c10d \
         --rdzv_endpoint=$TARGETIP:$TARGETPORT -m train

However, on node 0 (the host) I got the error

[E1124 16:39:09.085084260 socket.cpp:1019] [c10d] The client socket has timed out after 60000ms while trying to connect to (<IP>, <PORT>).
[W1124 16:39:09.085392878 TCPStore.cpp:340] [c10d] TCP client failed to connect/validate to host .1:29500 - retrying (try=0, timeout=60000ms, delay=25938ms): The client socket has timed out after 60000ms while trying to connect to (<IP>, <PORT>).   e-zpassnh
(…)
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

How could I solve this ?

Hello,
Here’s a clean checklist to fix the TCPStore timeout issue when using torchrun across two machines over VPN. Your symptoms indicate that the rendezvous (TCPStore) port is reachable via netcat, but PyTorch cannot complete the rendezvous handshake — this almost always comes down to config mistakes, port binding, hostnames, or duplicate node-rank logic.

Best Regards

fduwjj · December 1, 2025, 10:24pm

I roughly remember you need to use slurm to run command cross hosts for torch run.