DDP Multi-node Training Hangs at TCPStore Initialization

Rishikesh_Dandekar · September 20, 2024, 7:54am

I am trying to train a model using Distributed Data Parallel (DDP) across multiple nodes. However, the training process hangs at the TCPStore initialization in the static_tcp_rendezvous.py script inside next_rendezous method implementation (line 55), and nothing gets printed after that. It seems like the content of my training script (train.py) never gets executed when using multiple nodes.

When I run the same script with DDP on a single node, the training runs successfully without any issues.

Command I use to run the script:
torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=“192.168.3.23” --master_port=1234 multinode.py

Pytorch version: 2.4.1

H-Huang · September 26, 2024, 2:32pm

Since you are using a single node, try setting --master_addr=“localhost”, probably not all processes can communicate via “192.168.3.23”