Training hangs on loss.backward() with DDP --nnodes=2 --nproc_per_node=3

Hi all,

I am stuck on the process hanging issue when DDP used. By few test, I found that it stops at loss.backward() at the first iteration. The interesting point is that this issue happens only when running the code over 2 nodes with over 3 gpus on each node.

I have two machines within the same local network. Each machine equips 8 gpus and cuda installed.
Machine 1:

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A5000               Off |   00000000:4F:00.0 Off |                  Off |
| 33%   50C    P8             22W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A5000               Off |   00000000:52:00.0 Off |                  Off |
| 39%   55C    P8             17W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A5000               Off |   00000000:56:00.0 Off |                  Off |
| 30%   27C    P8             18W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A5000               Off |   00000000:57:00.0 Off |                  Off |
| 30%   28C    P8             20W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A5000               Off |   00000000:CE:00.0 Off |                  Off |
| 30%   27C    P8             20W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA RTX A5000               Off |   00000000:D1:00.0 Off |                  Off |
| 30%   27C    P8             20W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA RTX A5000               Off |   00000000:D5:00.0 Off |                  Off |
| 30%   29C    P8             16W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA RTX A5000               Off |   00000000:D6:00.0 Off |                  Off |
| 30%   29C    P8             17W /  230W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Machine 2:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:01:00.0 Off |                    0 |
|  0%   43C    P8    23W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          On   | 00000000:25:00.0 Off |                    0 |
|  0%   47C    P8    22W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A40          On   | 00000000:41:00.0 Off |                    0 |
|  0%   27C    P8    23W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A40          On   | 00000000:61:00.0 Off |                    0 |
|  0%   27C    P8    23W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A40          On   | 00000000:81:00.0 Off |                    0 |
|  0%   33C    P8    31W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A40          On   | 00000000:A1:00.0 Off |                    0 |
|  0%   31C    P8    24W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A40          On   | 00000000:C1:00.0 Off |                    0 |
|  0%   30C    P8    23W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A40          On   | 00000000:E1:00.0 Off |                    0 |
|  0%   28C    P8    23W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

When I use this command:
NCCL_SOCKET_IFNAME=eno1 NCCL_P2P_DISABLE=1 NO_ALBUMENTATIONS_UPDATE=1 torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=host_node_addr:29400 -m glue-factory.train batch_size=128
, where each gpu takes 16 sample out of 128 for each mini batch, it works well.

This command also works fine, verifying the two nodes are communicating perfectly:
NCCL_SOCKET_IFNAME=eno1 NCCL_P2P_DISABLE=1 NO_ALBUMENTATIONS_UPDATE=1 torchrun --nnodes=2 --nproc_per_node=2 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=host_node_addr:29400 -m glue-factory.train batch_size=64
, where each gpu takes 16 samples per mini batch.

However, this commands doesn’t work when --nproc_per_node > 2:
NCCL_SOCKET_IFNAME=eno1 NCCL_P2P_DISABLE=1 NO_ALBUMENTATIONS_UPDATE=1 torchrun --nnodes=2 --nproc_per_node=3 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=host_node_addr:29400 -m glue-factory.train batch_size=6
, where each gpu takes 1 sample per mini batch.

The error log is following, saying the processes got time out signal due to ALLREDUCE operation i.g., loss.backward():

rank0]:[E517 19:25:06.209264982 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600060 milliseconds before timing out.
[rank0]:[E517 19:25:06.209801226 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank4]:[E517 19:25:06.247265883 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600098 milliseconds before timing out.
[rank4]:[E517 19:25:06.247685696 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 4] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank7]:[E517 19:25:06.292129309 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600015 milliseconds before timing out.
[rank7]:[E517 19:25:06.292596487 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank5]:[E517 19:25:06.310787916 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600037 milliseconds before timing out.
[rank5]:[E517 19:25:06.311238626 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 5] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank6]:[E517 19:25:06.348387063 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
[rank6]:[E517 19:25:06.348858713 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 6] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank2]:[E517 19:25:06.350021754 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600006 milliseconds before timing out.
[rank2]:[E517 19:25:06.350501739 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank3]:[E517 19:25:06.356503284 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
[rank3]:[E517 19:25:06.356935837 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank1]:[E517 19:25:06.358857753 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
[rank1]:[E517 19:25:06.359291018 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank6]:[E517 19:25:07.646662561 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 6] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank6]:[E517 19:25:07.646680342 ProcessGroupNCCL.cpp:630] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E517 19:25:07.646684725 ProcessGroupNCCL.cpp:636] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E517 19:25:07.647492384 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 4] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank4]:[E517 19:25:07.647508092 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E517 19:25:07.647512556 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E517 19:25:07.647509443 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank3]:[E517 19:25:07.647525117 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E517 19:25:07.647531200 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E517 19:25:07.647848914 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 5] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank5]:[E517 19:25:07.647865370 ProcessGroupNCCL.cpp:630] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E517 19:25:07.647871322 ProcessGroupNCCL.cpp:636] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E517 19:25:07.647926445 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f859a12e446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f859b441772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f859b448bb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f859b44a61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f85e3dd75c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f85e6725609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f85e64ee353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f859a12e446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f859b441772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f859b448bb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f859b44a61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f85e3dd75c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f85e6725609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f85e64ee353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f859a12e446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7f859b0b771b in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f85e3dd75c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x8609 (0x7f85e6725609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f85e64ee353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank4]:[E517 19:25:07.648765999 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600098 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5185aab446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f5186dbe772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f5186dc5bb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f5186dc761d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f51cf7545c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f51d20a2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f51d1e6b353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E517 19:25:07.648766792 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd004ff2446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fd006305772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fd00630cbb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fd00630e61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fd04ec9b5c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7fd0515e9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fd0513b2353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
terminate called after throwing an instance of 'c10::DistBackendError'
[rank5]:[E517 19:25:07.649093588 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600037 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f966eea1446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f96701b4772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f96701bbbb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f96701bd61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f96b8b4a5c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f96bb498609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f96bb261353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():    what():  [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd004ff2446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fd006305772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fd00630cbb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fd00630e61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fd04ec9b5c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7fd0515e9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fd0513b2353 in /lib/x86_64-linux-gnu/libc.so.6)

...

W0517 19:25:07.770000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180349 closing signal SIGTERM
W0517 19:25:07.771000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180350 closing signal SIGTERM
W0517 19:25:07.772000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180351 closing signal SIGTERM
W0517 19:25:07.772000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180352 closing signal SIGTERM
W0517 19:25:07.772000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180354 closing signal SIGTERM
W0517 19:25:07.773000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180355 closing signal SIGTERM
W0517 19:25:07.773000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180356 closing signal SIGTERM
E0517 19:25:08.039000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 4 (pid: 2180353) of binary: /miniconda3/envs/glue-factory/bin/python3
Traceback (most recent call last):
  File "/miniconda3/envs/glue-factory/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
gluefactory.train FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-17_19:25:07
  host      : host_node_addr
  rank      : 4 (local_rank: 4)
  exitcode  : -6 (pid: 2180353)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2180353
========================================================

I have no idea why this happens. Could someone help resolve this issue?

Thanks!

Did you make sure the same number of batches is used on each rank as described here?

Thanks for replying @ptrblck.
When I check it, it appears the gpus are assigned same amount of load.
Machine1:

rank: 2, len(loader): 25000, data['view0']['image'].shape: torch.Size([1, 3, 384, 512])
rank: 0, len(loader): 25000, data['view0']['image'].shape: torch.Size([1, 3, 384, 512])
rank: 1, len(loader): 25000, data['view0']['image'].shape: torch.Size([1, 3, 384, 512])

Machine2:

rank: 3, len(loader): 25000, data['view0']['image'].shape: torch.Size([1, 3, 384, 512])
rank: 4, len(loader): 25000, data['view0']['image'].shape: torch.Size([1, 3, 384, 512])
rank: 5, len(loader): 25000, data['view0']['image'].shape: torch.Size([1, 3, 384, 512])

The batch size dimension is 1 on each gpu since I set the total batchsize 6 and it is divided by 6 gpus over two nodes.

I also tested the command on the same machine and found this is working..
Terminal 1 in machine 1:
CUDA_VISIBLE_DEVICES=2,3,4 NCCL_SOCKET_IFNAME=eno1 NCCL_P2P_DISABLE=1 NO_ALBUMENTATIONS_UPDATE=1 torchrun --nnodes=2 --nproc_per_node=3 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=host_node_addr:29400 -m gluefactory.train data.batch_size=6
Terminal 2 in machine 1:
CUDA_VISIBLE_DEVICES=5,6,7 NCCL_SOCKET_IFNAME=eno1 NCCL_P2P_DISABLE=1 NO_ALBUMENTATIONS_UPDATE=1 torchrun --nnodes=2 --nproc_per_node=3 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=host_node_addr:29400 -m gluefactory.train data.batch_size=6

Hi @ptrblck , may I ask you give me some advice to debug the problem further?