I am trying to train a model using Distributed Data Parallel (DDP) across multiple nodes. However, the training process hangs at the TCPStore initialization in the static_tcp_rendezvous.py
script inside next_rendezous method implementation (line 55), and nothing gets printed after that. It seems like the content of my training script (train.py
) never gets executed when using multiple nodes.
When I run the same script with DDP on a single node, the training runs successfully without any issues.
Command I use to run the script:
torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=“192.168.3.23” --master_port=1234 multinode.py
Pytorch version: 2.4.1