Training hangs on loss.backward() with DDP --nnodes=2 --nproc_per_node=3

Chard_Lee · May 18, 2025, 12:41am

Hi all,

I am stuck on the process hanging issue when DDP used. By few test, I found that it stops at loss.backward() at the first iteration. The interesting point is that this issue happens only when running the code over 2 nodes with over 3 gpus on each node.

I have two machines within the same local network. Each machine equips 8 gpus and cuda installed.
Machine 1:

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A5000               Off |   00000000:4F:00.0 Off |                  Off |
| 33%   50C    P8             22W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A5000               Off |   00000000:52:00.0 Off |                  Off |
| 39%   55C    P8             17W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A5000               Off |   00000000:56:00.0 Off |                  Off |
| 30%   27C    P8             18W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A5000               Off |   00000000:57:00.0 Off |                  Off |
| 30%   28C    P8             20W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A5000               Off |   00000000:CE:00.0 Off |                  Off |
| 30%   27C    P8             20W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA RTX A5000               Off |   00000000:D1:00.0 Off |                  Off |
| 30%   27C    P8             20W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA RTX A5000               Off |   00000000:D5:00.0 Off |                  Off |
| 30%   29C    P8             16W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA RTX A5000               Off |   00000000:D6:00.0 Off |                  Off |
| 30%   29C    P8             17W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Machine 2:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:01:00.0 Off |                    0 |
|  0%   43C    P8    23W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          On   | 00000000:25:00.0 Off |                    0 |
|  0%   47C    P8    22W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A40          On   | 00000000:41:00.0 Off |                    0 |
|  0%   27C    P8    23W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A40          On   | 00000000:61:00.0 Off |                    0 |
|  0%   27C    P8    23W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A40          On   | 00000000:81:00.0 Off |                    0 |
|  0%   33C    P8    31W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A40          On   | 00000000:A1:00.0 Off |                    0 |
|  0%   31C    P8    24W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A40          On   | 00000000:C1:00.0 Off |                    0 |
|  0%   30C    P8    23W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A40          On   | 00000000:E1:00.0 Off |                    0 |
|  0%   28C    P8    23W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

When I use this command:
NCCL_SOCKET_IFNAME=eno1 NCCL_P2P_DISABLE=1 NO_ALBUMENTATIONS_UPDATE=1 torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=host_node_addr:29400 -m glue-factory.train batch_size=128
, where each gpu takes 16 sample out of 128 for each mini batch, it works well.

This command also works fine, verifying the two nodes are communicating perfectly:
NCCL_SOCKET_IFNAME=eno1 NCCL_P2P_DISABLE=1 NO_ALBUMENTATIONS_UPDATE=1 torchrun --nnodes=2 --nproc_per_node=2 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=host_node_addr:29400 -m glue-factory.train batch_size=64
, where each gpu takes 16 samples per mini batch.

However, this commands doesn’t work when --nproc_per_node > 2:
NCCL_SOCKET_IFNAME=eno1 NCCL_P2P_DISABLE=1 NO_ALBUMENTATIONS_UPDATE=1 torchrun --nnodes=2 --nproc_per_node=3 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=host_node_addr:29400 -m glue-factory.train batch_size=6
, where each gpu takes 1 sample per mini batch.

The error log is following, saying the processes got time out signal due to ALLREDUCE operation i.g., loss.backward():

rank0]:[E517 19:25:06.209264982 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600060 milliseconds before timing out.
[rank0]:[E517 19:25:06.209801226 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank4]:[E517 19:25:06.247265883 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600098 milliseconds before timing out.
[rank4]:[E517 19:25:06.247685696 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 4] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank7]:[E517 19:25:06.292129309 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600015 milliseconds before timing out.
[rank7]:[E517 19:25:06.292596487 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank5]:[E517 19:25:06.310787916 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600037 milliseconds before timing out.
[rank5]:[E517 19:25:06.311238626 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 5] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank6]:[E517 19:25:06.348387063 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
[rank6]:[E517 19:25:06.348858713 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 6] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank2]:[E517 19:25:06.350021754 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600006 milliseconds before timing out.
[rank2]:[E517 19:25:06.350501739 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank3]:[E517 19:25:06.356503284 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
[rank3]:[E517 19:25:06.356935837 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank1]:[E517 19:25:06.358857753 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
[rank1]:[E517 19:25:06.359291018 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank6]:[E517 19:25:07.646662561 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 6] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank6]:[E517 19:25:07.646680342 ProcessGroupNCCL.cpp:630] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E517 19:25:07.646684725 ProcessGroupNCCL.cpp:636] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E517 19:25:07.647492384 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 4] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank4]:[E517 19:25:07.647508092 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E517 19:25:07.647512556 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E517 19:25:07.647509443 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank3]:[E517 19:25:07.647525117 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E517 19:25:07.647531200 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E517 19:25:07.647848914 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 5] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank5]:[E517 19:25:07.647865370 ProcessGroupNCCL.cpp:630] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E517 19:25:07.647871322 ProcessGroupNCCL.cpp:636] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E517 19:25:07.647926445 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f859a12e446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f859b441772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f859b448bb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f859b44a61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f85e3dd75c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f85e6725609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f85e64ee353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f859a12e446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f859b441772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f859b448bb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f859b44a61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f85e3dd75c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f85e6725609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f85e64ee353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f859a12e446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7f859b0b771b in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f85e3dd75c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x8609 (0x7f85e6725609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f85e64ee353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank4]:[E517 19:25:07.648765999 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600098 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5185aab446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f5186dbe772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f5186dc5bb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f5186dc761d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f51cf7545c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f51d20a2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f51d1e6b353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E517 19:25:07.648766792 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd004ff2446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fd006305772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fd00630cbb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fd00630e61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fd04ec9b5c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7fd0515e9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fd0513b2353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
terminate called after throwing an instance of 'c10::DistBackendError'
[rank5]:[E517 19:25:07.649093588 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600037 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f966eea1446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f96701b4772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f96701bbbb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f96701bd61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f96b8b4a5c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f96bb498609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f96bb261353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():    what():  [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd004ff2446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fd006305772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fd00630cbb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fd00630e61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fd04ec9b5c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7fd0515e9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fd0513b2353 in /lib/x86_64-linux-gnu/libc.so.6)

...

W0517 19:25:07.770000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180349 closing signal SIGTERM
W0517 19:25:07.771000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180350 closing signal SIGTERM
W0517 19:25:07.772000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180351 closing signal SIGTERM
W0517 19:25:07.772000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180352 closing signal SIGTERM
W0517 19:25:07.772000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180354 closing signal SIGTERM
W0517 19:25:07.773000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180355 closing signal SIGTERM
W0517 19:25:07.773000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180356 closing signal SIGTERM
E0517 19:25:08.039000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 4 (pid: 2180353) of binary: /miniconda3/envs/glue-factory/bin/python3
Traceback (most recent call last):
  File "/miniconda3/envs/glue-factory/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
gluefactory.train FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-17_19:25:07
  host      : host_node_addr
  rank      : 4 (local_rank: 4)
  exitcode  : -6 (pid: 2180353)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2180353
========================================================

I have no idea why this happens. Could someone help resolve this issue?

Thanks!

ptrblck · May 18, 2025, 2:52pm

Did you make sure the same number of batches is used on each rank as described here?

Chard_Lee · May 18, 2025, 7:22pm

Thanks for replying @ptrblck.
When I check it, it appears the gpus are assigned same amount of load.
Machine1:

rank: 2, len(loader): 25000, data['view0']['image'].shape: torch.Size([1, 3, 384, 512])
rank: 0, len(loader): 25000, data['view0']['image'].shape: torch.Size([1, 3, 384, 512])
rank: 1, len(loader): 25000, data['view0']['image'].shape: torch.Size([1, 3, 384, 512])

Machine2:

rank: 3, len(loader): 25000, data['view0']['image'].shape: torch.Size([1, 3, 384, 512])
rank: 4, len(loader): 25000, data['view0']['image'].shape: torch.Size([1, 3, 384, 512])
rank: 5, len(loader): 25000, data['view0']['image'].shape: torch.Size([1, 3, 384, 512])

The batch size dimension is 1 on each gpu since I set the total batchsize 6 and it is divided by 6 gpus over two nodes.

I also tested the command on the same machine and found this is working..
Terminal 1 in machine 1:
CUDA_VISIBLE_DEVICES=2,3,4 NCCL_SOCKET_IFNAME=eno1 NCCL_P2P_DISABLE=1 NO_ALBUMENTATIONS_UPDATE=1 torchrun --nnodes=2 --nproc_per_node=3 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=host_node_addr:29400 -m gluefactory.train data.batch_size=6
Terminal 2 in machine 1:
CUDA_VISIBLE_DEVICES=5,6,7 NCCL_SOCKET_IFNAME=eno1 NCCL_P2P_DISABLE=1 NO_ALBUMENTATIONS_UPDATE=1 torchrun --nnodes=2 --nproc_per_node=3 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=host_node_addr:29400 -m gluefactory.train data.batch_size=6

Chard_Lee · May 22, 2025, 3:50am

Hi @ptrblck , may I ask you give me some advice to debug the problem further?