I’m running multi-node Distributed Data Parallel (DDP) training with torchrun
using two servers, each with one GPU. The model I’m training is YOLOv9, and the torchrun
command I use is:
torchrun \
--nnodes=2 \
--nproc-per-node=1 \
--max-restarts=3 \
--rdzv-id=123 \
--rdzv-backend=c10d \
--rdzv-endpoint=10.40.0.10:1234 \
train_dual.py \
--workers 4 \
--device 0 \
--batch 8 \
--data ../widerface.yaml \
--img 640 \
--cfg models/detect/yolov9-c.yaml \
--weights '' \
--name yolov9-c \
--hyp hyp.scratch-high.yaml \
--min-items 0 \
--epochs 5 \
--close-mosaic 15
Training proceeds successfully on both nodes, but after training completes on the master node, the worker node raises the following errors:
- Warning:
[rank1]:[W1231 04:12:49.342268801 ProcessGroupNCCL.cpp:1250] Warning: WARNING:
process group has NOT been destroyed before we destruct ProcessGroupNCCL. ...
- TCPStore Error:
[W1231 04:13:16.124330756 TCPStore.cpp:122] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[toss-pod02]:49540, remote=[::ffff:172.17.154.66]:1234): Broken pipe
Total Error
[rank1]:[W1231 04:12:49.342268801 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensu re that any pending NCCL operations have finished in this process. In rare cas es this process can exit before this point and block the progress of another m ember of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[W1231 04:13:16.117080767 TCPStore.cpp:131] [c10d] recvVector failed on Socket Impl(fd=3, addr=[toss-pod02]:49540, remote=[::ffff:172.17.154.66]:1234): faile d to recv, got 0 bytes
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:67 0 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd62d ced446 in /usr/local/lib64/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5fec818 (0x7fd668d2f818 in /usr/local/lib64/p ython3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5fece49 (0x7fd668d2fe49 in /usr/local/lib64/p ython3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5fefd67 (0x7fd668d32d67 in /usr/local/lib64/p ython3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::compareSet(std::string const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std:: allocator<unsigned char> > const&) + 0x254 (0x7fd668d2c5e4 in /usr/local/lib64 /python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xd7c7a4 (0x7fd6787247a4 in /usr/local/lib64/py thon3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x4ccad4 (0x7fd677e74ad4 in /usr/local/lib64/py thon3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #30: <unknown function> + 0x295d0 (0x7fd67a1895d0 in /lib64/libc.so.6)
frame #31: __libc_start_main + 0x80 (0x7fd67a189680 in /lib64/libc.so.6)
frame #32: _start + 0x25 (0x560505115095 in /usr/bin/python3.11)
W1231 04:13:16.008000 300 torch/distributed/elastic/rendezvous/dynamic_rendezv ous.py:1282] The node 'toss-pod02_300_0' has failed to shutdown the rendezvous '123' due to an error of type RendezvousConnectionError.
[W1231 04:13:16.124330756 TCPStore.cpp:122] [c10d] sendBytes failed on SocketI mpl(fd=3, addr=[toss-pod02]:49540, remote=[::ffff:172.17.154.66]:1234): Broken pipe
Exception raised from sendBytes at ../torch/csrc/distributed/c10d/Utils.hpp:64 5 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd62d ced446 in /usr/local/lib64/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5fecb29 (0x7fd668d2fb29 in /usr/local/lib64/p ython3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::compareSet(std::string const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std:: allocator<unsigned char> > const&) + 0x22d (0x7fd668d2c5bd in /usr/local/lib64 /python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0xd7c7a4 (0x7fd6787247a4 in /usr/local/lib64/py thon3.11/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x4ccad4 (0x7fd677e74ad4 in /usr/local/lib64/py thon3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: <unknown function> + 0x295d0 (0x7fd67a1895d0 in /lib64/libc.so.6)
frame #27: __libc_start_main + 0x80 (0x7fd67a189680 in /lib64/libc.so.6)
frame #28: _start + 0x25 (0x560505115095 in /usr/bin/python3.11)
W1231 04:13:16.013000 300 torch/distributed/elastic/rendezvous/dynamic_rendezv ous.py:1282] The node 'toss-pod02_300_0' has failed to shutdown the rendezvous '123' due to an error of type RendezvousConnectionError.
To address this, I added the following code at the end of my training script to synchronize and clean up the process group:
# End training
import torch.distributed as dist
# Synchronize processes
dist.barrier()
# Destroy process group
dist.destroy_process_group()
However, the same error persists.
Request for Help:
- How can I resolve this issue to ensure the process group is properly destroyed after training?
- Are there additional steps or configurations I should consider for multi-node DDP training?