Hi all,
I am stuck on the process hanging issue when DDP used. By few test, I found that it stops at loss.backward()
at the first iteration. The interesting point is that this issue happens only when running the code over 2 nodes with over 3 gpus on each node.
I have two machines within the same local network. Each machine equips 8 gpus and cuda installed.
Machine 1:
$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A5000 Off | 00000000:4F:00.0 Off | Off |
| 33% 50C P8 22W / 230W | 18MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX A5000 Off | 00000000:52:00.0 Off | Off |
| 39% 55C P8 17W / 230W | 18MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA RTX A5000 Off | 00000000:56:00.0 Off | Off |
| 30% 27C P8 18W / 230W | 18MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA RTX A5000 Off | 00000000:57:00.0 Off | Off |
| 30% 28C P8 20W / 230W | 18MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA RTX A5000 Off | 00000000:CE:00.0 Off | Off |
| 30% 27C P8 20W / 230W | 18MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA RTX A5000 Off | 00000000:D1:00.0 Off | Off |
| 30% 27C P8 20W / 230W | 18MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA RTX A5000 Off | 00000000:D5:00.0 Off | Off |
| 30% 29C P8 16W / 230W | 18MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA RTX A5000 Off | 00000000:D6:00.0 Off | Off |
| 30% 29C P8 17W / 230W | 18MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Machine 2:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 On | 00000000:01:00.0 Off | 0 |
| 0% 43C P8 23W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:25:00.0 Off | 0 |
| 0% 47C P8 22W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A40 On | 00000000:41:00.0 Off | 0 |
| 0% 27C P8 23W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A40 On | 00000000:61:00.0 Off | 0 |
| 0% 27C P8 23W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A40 On | 00000000:81:00.0 Off | 0 |
| 0% 33C P8 31W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A40 On | 00000000:A1:00.0 Off | 0 |
| 0% 31C P8 24W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A40 On | 00000000:C1:00.0 Off | 0 |
| 0% 30C P8 23W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A40 On | 00000000:E1:00.0 Off | 0 |
| 0% 28C P8 23W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
When I use this command:
NCCL_SOCKET_IFNAME=eno1 NCCL_P2P_DISABLE=1 NO_ALBUMENTATIONS_UPDATE=1 torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=host_node_addr:29400 -m glue-factory.train batch_size=128
, where each gpu takes 16 sample out of 128 for each mini batch, it works well.
This command also works fine, verifying the two nodes are communicating perfectly:
NCCL_SOCKET_IFNAME=eno1 NCCL_P2P_DISABLE=1 NO_ALBUMENTATIONS_UPDATE=1 torchrun --nnodes=2 --nproc_per_node=2 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=host_node_addr:29400 -m glue-factory.train batch_size=64
, where each gpu takes 16 samples per mini batch.
However, this commands doesn’t work when --nproc_per_node > 2:
NCCL_SOCKET_IFNAME=eno1 NCCL_P2P_DISABLE=1 NO_ALBUMENTATIONS_UPDATE=1 torchrun --nnodes=2 --nproc_per_node=3 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=host_node_addr:29400 -m glue-factory.train batch_size=6
, where each gpu takes 1 sample per mini batch.
The error log is following, saying the processes got time out signal due to ALLREDUCE operation i.g., loss.backward():
rank0]:[E517 19:25:06.209264982 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600060 milliseconds before timing out.
[rank0]:[E517 19:25:06.209801226 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank4]:[E517 19:25:06.247265883 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600098 milliseconds before timing out.
[rank4]:[E517 19:25:06.247685696 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 4] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank7]:[E517 19:25:06.292129309 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600015 milliseconds before timing out.
[rank7]:[E517 19:25:06.292596487 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank5]:[E517 19:25:06.310787916 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600037 milliseconds before timing out.
[rank5]:[E517 19:25:06.311238626 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 5] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank6]:[E517 19:25:06.348387063 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
[rank6]:[E517 19:25:06.348858713 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 6] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank2]:[E517 19:25:06.350021754 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600006 milliseconds before timing out.
[rank2]:[E517 19:25:06.350501739 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank3]:[E517 19:25:06.356503284 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
[rank3]:[E517 19:25:06.356935837 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank1]:[E517 19:25:06.358857753 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
[rank1]:[E517 19:25:06.359291018 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank6]:[E517 19:25:07.646662561 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 6] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank6]:[E517 19:25:07.646680342 ProcessGroupNCCL.cpp:630] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E517 19:25:07.646684725 ProcessGroupNCCL.cpp:636] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E517 19:25:07.647492384 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 4] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank4]:[E517 19:25:07.647508092 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E517 19:25:07.647512556 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E517 19:25:07.647509443 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank3]:[E517 19:25:07.647525117 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E517 19:25:07.647531200 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E517 19:25:07.647848914 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 5] Timeout at NCCL work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank5]:[E517 19:25:07.647865370 ProcessGroupNCCL.cpp:630] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E517 19:25:07.647871322 ProcessGroupNCCL.cpp:636] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E517 19:25:07.647926445 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f859a12e446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f859b441772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f859b448bb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f859b44a61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f85e3dd75c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f85e6725609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f85e64ee353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f859a12e446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f859b441772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f859b448bb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f859b44a61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f85e3dd75c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f85e6725609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f85e64ee353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f859a12e446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7f859b0b771b in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f85e3dd75c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x8609 (0x7f85e6725609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f85e64ee353 in /lib/x86_64-linux-gnu/libc.so.6)
[rank4]:[E517 19:25:07.648765999 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600098 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5185aab446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f5186dbe772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f5186dc5bb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f5186dc761d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f51cf7545c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f51d20a2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f51d1e6b353 in /lib/x86_64-linux-gnu/libc.so.6)
[rank3]:[E517 19:25:07.648766792 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd004ff2446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fd006305772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fd00630cbb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fd00630e61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fd04ec9b5c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7fd0515e9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fd0513b2353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
terminate called after throwing an instance of 'c10::DistBackendError'
[rank5]:[E517 19:25:07.649093588 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600037 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f966eea1446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f96701b4772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f96701bbbb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f96701bd61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f96b8b4a5c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f96bb498609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f96bb261353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): what(): [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=24, OpType=ALLREDUCE, NumelIn=512, NumelOut=512, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd004ff2446 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fd006305772 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fd00630cbb3 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fd00630e61d in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fd04ec9b5c0 in /miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7fd0515e9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fd0513b2353 in /lib/x86_64-linux-gnu/libc.so.6)
...
W0517 19:25:07.770000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180349 closing signal SIGTERM
W0517 19:25:07.771000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180350 closing signal SIGTERM
W0517 19:25:07.772000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180351 closing signal SIGTERM
W0517 19:25:07.772000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180352 closing signal SIGTERM
W0517 19:25:07.772000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180354 closing signal SIGTERM
W0517 19:25:07.773000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180355 closing signal SIGTERM
W0517 19:25:07.773000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2180356 closing signal SIGTERM
E0517 19:25:08.039000 2180234 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 4 (pid: 2180353) of binary: /miniconda3/envs/glue-factory/bin/python3
Traceback (most recent call last):
File "/miniconda3/envs/glue-factory/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/miniconda3/envs/glue-factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
gluefactory.train FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-05-17_19:25:07
host : host_node_addr
rank : 4 (local_rank: 4)
exitcode : -6 (pid: 2180353)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2180353
========================================================
I have no idea why this happens. Could someone help resolve this issue?
Thanks!