c10::DistBackendError after 120 epochs

Boltzmachine · January 16, 2025, 10:27pm

I am running my model on 5 A100 GPUs. After running for 2:30 hours (120 epochs), the process terminates and throws NCCL error.
NOTE: I run it many times and each time it fails after 120 epochs.

pytorch version: 2.2.2
cuda: 12.1

[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=58327, OpType=ALLREDUCE, NumelIn=288093632, NumelOut=288093632, Timeout(ms)=600000) ran for 600591 milliseconds before timing out.
ip-172-31-85-171:470266:470927 [1] NCCL INFO [Service thread] Connection closed by localRank 0
ip-172-31-85-171:470269:470928 [4] NCCL INFO [Service thread] Connection closed by localRank 0
ip-172-31-85-171:470265:470930 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=58327, OpType=ALLREDUCE, NumelIn=288093632, NumelOut=288093632, Timeout(ms)=600000) ran for 600975 milliseconds before timing out.
ip-172-31-85-171:470265:470930 [0] NCCL INFO [Service thread] Connection closed by localRank 1
ip-172-31-85-171:470266:470927 [1] NCCL INFO [Service thread] Connection closed by localRank 1
ip-172-31-85-171:470267:470926 [2] NCCL INFO [Service thread] Connection closed by localRank 1
ip-172-31-85-171:470265:470918 [0] NCCL INFO comm 0x5f5e8e5f4c00 rank 0 nranks 5 cudaDev 0 busId 101c0 - Abort COMPLETE
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=58327, OpType=ALLREDUCE, NumelIn=288093632, NumelOut=288093632, Timeout(ms)=600000) ran for 600591 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403380909/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7d47dc180d87 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7d47900c04d6 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7d47900c3a2d in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7d47900c4629 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b65 (0x7d47dc6f1b65 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7d47e4094ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7d47e4126850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=58327, OpType=ALLREDUCE, NumelIn=288093632, NumelOut=288093632, Timeout(ms)=600000) ran for 600591 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403380909/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7d47dc180d87 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7d47900c04d6 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7d47900c3a2d in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7d47900c4629 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b65 (0x7d47dc6f1b65 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7d47e4094ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7d47e4126850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1711403380909/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7d47dc180d87 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe191e1 (0x7d478fe191e1 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3b65 (0x7d47dc6f1b65 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7d47e4094ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7d47e4126850 in /lib/x86_64-linux-gnu/libc.so.6)

ip-172-31-85-171:470266:470914 [1] NCCL INFO comm 0x558bf205e9f0 rank 1 nranks 5 cudaDev 1 busId 101d0 - Abort COMPLETE
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=58327, OpType=ALLREDUCE, NumelIn=288093632, NumelOut=288093632, Timeout(ms)=600000) ran for 600975 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403380909/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x71a46ff9ed87 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x71a423ac04d6 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x71a423ac3a2d in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x71a423ac4629 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b65 (0x71a4700f1b65 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x71a477c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x71a477d26850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=58327, OpType=ALLREDUCE, NumelIn=288093632, NumelOut=288093632, Timeout(ms)=600000) ran for 600975 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403380909/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x71a46ff9ed87 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x71a423ac04d6 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x71a423ac3a2d in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x71a423ac4629 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b65 (0x71a4700f1b65 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x71a477c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x71a477d26850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1711403380909/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x71a46ff9ed87 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe191e1 (0x71a4238191e1 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3b65 (0x71a4700f1b65 in /opt/conda/envs/brain/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x71a477c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x71a477d26850 in /lib/x86_64-linux-gnu/libc.so.6)

[2025-01-16 22:20:42,170] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 470266 closing signal SIGTERM
[2025-01-16 22:20:42,172] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 470267 closing signal SIGTERM
[2025-01-16 22:20:42,172] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 470268 closing signal SIGTERM
[2025-01-16 22:20:42,173] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 470269 closing signal SIGTERM
[2025-01-16 22:20:44,840] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 470265) of binary: /opt/conda/envs/brain/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/envs/brain/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/brain/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/opt/conda/envs/brain/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1165, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/envs/brain/lib/python3.10/site-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/envs/brain/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/opt/conda/envs/brain/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/brain/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

ptrblck · January 17, 2025, 2:45am

Did you check if this epoch differs somehow? E.g. did you change the data and are using an imbalanced number of batches?

Boltzmachine · January 17, 2025, 3:13pm

Oh yeah the epochs differ when I change the random seeds for shuffling the data. Now it stuck without throwing any message. There is no problem when I run it on a single GPU.