DDP training stuck and killed in the middle of a random epoch

sivannavis · June 17, 2024, 10:21pm

Hi! I’m training my model with pytorch lightning and wandb. I was able to run my training for many epochs (20+) and then the progress bar stops moving and it got killed with error:

[rank3]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=5
[rank3]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 3] ProcessGroupNCCL preparing to dump debug info.
[rank3]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 3] [PG 0 Rank 3] ProcessGroupNCCL’s watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 5
[rank: 3] Child process with PID 70148 terminated with code -6. Forcefully terminating all other processes to avoid zombies
Killed

And if I have export TORCH_NCCL_ENABLE_MONITORING=0 in my .bashrc. I still got errors:

Epoch 55:  80%|▊| 4/5 [01:32<00:23,  0.04it/s, v_num=un96, train/loss_simple_step=0.887, train/loss_vlb_step=0.887, train/loss_step=0.887, global_step=278.0, lr_abs=1.39e-5, train/loss_simple_epoch=0.979, train/loss_vlb_epoch=0.979, train/loss_epoch=0.9[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10326, OpType=BROADCAST, NumelIn=67844042, NumelOut=67844042, Timeout(ms)=1800000) ran for 1800048 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 10326, last enqueued NCCL work: 10330, last completed NCCL work: 10325.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10326, OpType=BROADCAST, NumelIn=67844042, NumelOut=67844042, Timeout(ms)=1800000) ran for 1800048 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f61f93ed897 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f61fa6c6c62 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f61fa6cba80 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f61fa6ccdcc in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f6246173df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f62484ff609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f62482ca353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10326, OpType=BROADCAST, NumelIn=67844042, NumelOut=67844042, Timeout(ms)=1800000) ran for 1800048 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f61f93ed897 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f61fa6c6c62 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f61fa6cba80 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f61fa6ccdcc in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f6246173df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f62484ff609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f62482ca353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f61f93ed897 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f61fa350119 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f6246173df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f62484ff609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f62482ca353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10326, OpType=BROADCAST, NumelIn=67844042, NumelOut=67844042, Timeout(ms)=1800000) ran for 1800081 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 10326, last enqueued NCCL work: 10330, last completed NCCL work: 10325.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10326, OpType=BROADCAST, NumelIn=67844042, NumelOut=67844042, Timeout(ms)=1800000) ran for 1800081 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9f58ff4897 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9f5a2cdc62 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9f5a2d2a80 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9f5a2d3dcc in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f9fa5d7adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f9fa8104609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f9fa7ecf353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10326, OpType=BROADCAST, NumelIn=67844042, NumelOut=67844042, Timeout(ms)=1800000) ran for 1800081 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9f58ff4897 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9f5a2cdc62 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9f5a2d2a80 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9f5a2d3dcc in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f9fa5d7adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f9fa8104609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f9fa7ecf353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9f58ff4897 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f9f59f57119 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f9fa5d7adf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f9fa8104609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f9fa7ecf353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10326, OpType=BROADCAST, NumelIn=67844042, NumelOut=67844042, Timeout(ms)=1800000) ran for 1800097 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 10326, last enqueued NCCL work: 10330, last completed NCCL work: 10325.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10326, OpType=BROADCAST, NumelIn=67844042, NumelOut=67844042, Timeout(ms)=1800000) ran for 1800097 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8abc588897 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8abd861c62 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8abd866a80 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8abd867dcc in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f8b0930edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f8b0b698609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f8b0b463353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10326, OpType=BROADCAST, NumelIn=67844042, NumelOut=67844042, Timeout(ms)=1800000) ran for 1800097 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8abc588897 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8abd861c62 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8abd866a80 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8abd867dcc in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f8b0930edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f8b0b698609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f8b0b463353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8abc588897 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f8abd4eb119 in /home/szding/.conda/envs/v2sa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f8b0930edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f8b0b698609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f8b0b463353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank: 1] Child process with PID 233307 terminated with code -6. Forcefully terminating all other processes to avoid zombies 🧟
Killed

DikshaMeghwal · July 1, 2024, 5:17pm

I am facing a similar issue while running a training regime using DDP.
For me it happens during validation not training. I suspected it might have to do with more worker processes running but I don’t understand what makes it go in a deadlock.
I looked at the memory and CPU utilization stats but nothing stood out there.

strongpear · July 12, 2024, 11:09am

Same here! Any solve for this error ?

Dolores_Garcia · July 17, 2024, 11:53am

same problem here, no solution yet