Issue with torchrun Multi-Node DDP Training: Process Group Not Destroyed Error

tnfks0606 · December 31, 2024, 5:21am

I’m running multi-node Distributed Data Parallel (DDP) training with torchrun using two servers, each with one GPU. The model I’m training is YOLOv9, and the torchrun command I use is:

torchrun \
  --nnodes=2 \
  --nproc-per-node=1 \
  --max-restarts=3 \
  --rdzv-id=123 \
  --rdzv-backend=c10d \
  --rdzv-endpoint=10.40.0.10:1234 \
  train_dual.py \
  --workers 4 \
  --device 0 \
  --batch 8 \
  --data ../widerface.yaml \
  --img 640 \
  --cfg models/detect/yolov9-c.yaml \
  --weights '' \
  --name yolov9-c \
  --hyp hyp.scratch-high.yaml \
  --min-items 0 \
  --epochs 5 \
  --close-mosaic 15

Training proceeds successfully on both nodes, but after training completes on the master node, the worker node raises the following errors:

Warning:

[rank1]:[W1231 04:12:49.342268801 ProcessGroupNCCL.cpp:1250] Warning: WARNING:
process group has NOT been destroyed before we destruct ProcessGroupNCCL. ...

TCPStore Error:

[W1231 04:13:16.124330756 TCPStore.cpp:122] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[toss-pod02]:49540, remote=[::ffff:172.17.154.66]:1234): Broken pipe

Total Error

[rank1]:[W1231 04:12:49.342268801 ProcessGroupNCCL.cpp:1250] Warning: WARNING:                                                                                   process group has NOT been destroyed before we destruct ProcessGroupNCCL. On                                                                                   normal program exit, the application should call destroy_process_group to ensu                                                                                  re that any pending NCCL operations have finished in this process. In rare cas                                                                                  es this process can exit before this point and block the progress of another m                                                                                  ember of the process group. This constraint has always been present,  but this                                                                                   warning has only been added since PyTorch 2.4 (function operator())
[W1231 04:13:16.117080767 TCPStore.cpp:131] [c10d] recvVector failed on Socket                                                                                  Impl(fd=3, addr=[toss-pod02]:49540, remote=[::ffff:172.17.154.66]:1234): faile                                                                                  d to recv, got 0 bytes
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:67                                                                                  0 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd62d                                                                                  ced446 in /usr/local/lib64/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5fec818 (0x7fd668d2f818 in /usr/local/lib64/p                                                                                  ython3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5fece49 (0x7fd668d2fe49 in /usr/local/lib64/p                                                                                  ython3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5fefd67 (0x7fd668d32d67 in /usr/local/lib64/p                                                                                  ython3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::compareSet(std::string const&, std::vector<unsigned                                                                                   char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::                                                                                  allocator<unsigned char> > const&) + 0x254 (0x7fd668d2c5e4 in /usr/local/lib64                                                                                  /python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xd7c7a4 (0x7fd6787247a4 in /usr/local/lib64/py                                                                                  thon3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x4ccad4 (0x7fd677e74ad4 in /usr/local/lib64/py                                                                                  thon3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #30: <unknown function> + 0x295d0 (0x7fd67a1895d0 in /lib64/libc.so.6)
frame #31: __libc_start_main + 0x80 (0x7fd67a189680 in /lib64/libc.so.6)
frame #32: _start + 0x25 (0x560505115095 in /usr/bin/python3.11)

W1231 04:13:16.008000 300 torch/distributed/elastic/rendezvous/dynamic_rendezv                                                                                  ous.py:1282] The node 'toss-pod02_300_0' has failed to shutdown the rendezvous                                                                                   '123' due to an error of type RendezvousConnectionError.
[W1231 04:13:16.124330756 TCPStore.cpp:122] [c10d] sendBytes failed on SocketI                                                                                  mpl(fd=3, addr=[toss-pod02]:49540, remote=[::ffff:172.17.154.66]:1234): Broken                                                                                   pipe
Exception raised from sendBytes at ../torch/csrc/distributed/c10d/Utils.hpp:64                                                                                  5 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd62d                                                                                  ced446 in /usr/local/lib64/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5fecb29 (0x7fd668d2fb29 in /usr/local/lib64/p                                                                                  ython3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::compareSet(std::string const&, std::vector<unsigned                                                                                   char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::                                                                                  allocator<unsigned char> > const&) + 0x22d (0x7fd668d2c5bd in /usr/local/lib64                                                                                  /python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0xd7c7a4 (0x7fd6787247a4 in /usr/local/lib64/py                                                                                  thon3.11/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x4ccad4 (0x7fd677e74ad4 in /usr/local/lib64/py                                                                                  thon3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: <unknown function> + 0x295d0 (0x7fd67a1895d0 in /lib64/libc.so.6)
frame #27: __libc_start_main + 0x80 (0x7fd67a189680 in /lib64/libc.so.6)
frame #28: _start + 0x25 (0x560505115095 in /usr/bin/python3.11)

W1231 04:13:16.013000 300 torch/distributed/elastic/rendezvous/dynamic_rendezv                                                                                  ous.py:1282] The node 'toss-pod02_300_0' has failed to shutdown the rendezvous                                                                                   '123' due to an error of type RendezvousConnectionError.

To address this, I added the following code at the end of my training script to synchronize and clean up the process group:

# End training
import torch.distributed as dist

# Synchronize processes
dist.barrier()

# Destroy process group
dist.destroy_process_group()

However, the same error persists.

Request for Help:

How can I resolve this issue to ensure the process group is properly destroyed after training?
Are there additional steps or configurations I should consider for multi-node DDP training?