Error: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data

This error always occurs when I train the model on multi-gpu for several minute.

Here’s the full logs

[Training] [2024-02-04T09:38:04.500446] [rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=548, OpType=ALLREDUCE, NumelIn=8398850, NumelOut=8398850, Timeout(ms)=300000) ran for 614355 milliseconds before timing out.
[Training] [2024-02-04T09:38:04.861434] 067c91739610:40494:40523 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[Training] [2024-02-04T09:38:04.879599] 067c91739610:40494:40506 [0] NCCL INFO comm 0x10d43a90 rank 0 nranks 4 cudaDev 0 busId 1000 - Abort COMPLETE
[Training] [2024-02-04T09:38:04.879639] [rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[Training] [2024-02-04T09:38:04.879647] [rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[Training] [2024-02-04T09:38:04.879724] [rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=548, OpType=ALLREDUCE, NumelIn=8398850, NumelOut=8398850, Timeout(ms)=300000) ran for 614355 milliseconds before timing out.
[Training] [2024-02-04T09:38:04.879733] Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
[Training] [2024-02-04T09:38:04.879739] frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd4e3037d87 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
[Training] [2024-02-04T09:38:04.879745] frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd4e41b2f66 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[Training] [2024-02-04T09:38:04.879751] frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd4e41b64bd in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
/root/ai-voice-cloning/venv/lib/python3.10/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'selection' is deprecated.
   Use 'selection_point()' or 'selection_interval()' instead; these functions also include more helpful docstrings.
  warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
/root/ai-voice-cloning/venv/lib/python3.10/site-packages/altair/vegalite/v5/api.py:469: AltairDeprecationWarning: The types 'single' and 'multi' are now
        combined and should be specified using "selection_point()".
  warnings.warn(
/root/ai-voice-cloning/venv/lib/python3.10/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'add_selection' is deprecated. Use 'add_params' instead.
  warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
[Training] [2024-02-04T09:38:04.879756] frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd4e41b70b9 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[Training] [2024-02-04T09:38:04.879782] frame #4: <unknown function> + 0xd6df4 (0x7fd531c38df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[Training] [2024-02-04T09:38:04.879789] frame #5: <unknown function> + 0x8609 (0x7fd532d47609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[Training] [2024-02-04T09:38:04.879794] frame #6: clone + 0x43 (0x7fd532e81353 in /lib/x86_64-linux-gnu/libc.so.6)
[Training] [2024-02-04T09:38:04.879799] 
[Training] [2024-02-04T09:38:04.880229] terminate called after throwing an instance of 'c10::DistBackendError'
[Training] [2024-02-04T09:38:04.880240]   what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=548, OpType=ALLREDUCE, NumelIn=8398850, NumelOut=8398850, Timeout(ms)=300000) ran for 614355 milliseconds before timing out.
[Training] [2024-02-04T09:38:04.880246] Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
[Training] [2024-02-04T09:38:04.880251] frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd4e3037d87 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
[Training] [2024-02-04T09:38:04.880256] frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd4e41b2f66 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[Training] [2024-02-04T09:38:04.880261] frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd4e41b64bd in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[Training] [2024-02-04T09:38:04.880266] frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd4e41b70b9 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[Training] [2024-02-04T09:38:04.880271] frame #4: <unknown function> + 0xd6df4 (0x7fd531c38df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[Training] [2024-02-04T09:38:04.880276] frame #5: <unknown function> + 0x8609 (0x7fd532d47609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[Training] [2024-02-04T09:38:04.880281] frame #6: clone + 0x43 (0x7fd532e81353 in /lib/x86_64-linux-gnu/libc.so.6)
[Training] [2024-02-04T09:38:04.880285] 
[Training] [2024-02-04T09:38:04.880290] Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
[Training] [2024-02-04T09:38:04.880296] frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd4e3037d87 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
[Training] [2024-02-04T09:38:04.880301] frame #1: <unknown function> + 0xdcc083 (0x7fd4e3f0f083 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[Training] [2024-02-04T09:38:04.880306] frame #2: <unknown function> + 0xd6df4 (0x7fd531c38df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[Training] [2024-02-04T09:38:04.880311] frame #3: <unknown function> + 0x8609 (0x7fd532d47609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[Training] [2024-02-04T09:38:04.880316] frame #4: clone + 0x43 (0x7fd532e81353 in /lib/x86_64-linux-gnu/libc.so.6)
[Training] [2024-02-04T09:38:04.880320] 
[Training] [2024-02-04T09:38:06.248993] 067c91739610:40495:40522 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[Training] [2024-02-04T09:38:06.269583] 067c91739610:40495:40505 [0] NCCL INFO comm 0x142f8350 rank 1 nranks 4 cudaDev 1 busId 81000 - Abort COMPLETE
[Training] [2024-02-04T09:38:06.269674] [rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[Training] [2024-02-04T09:38:06.269685] [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[Training] [2024-02-04T09:38:06.269692] [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=548, OpType=ALLREDUCE, NumelIn=8398850, NumelOut=8398850, Timeout(ms)=300000) ran for 300126 milliseconds before timing out.
[Training] [2024-02-04T09:38:06.269699] Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
[Training] [2024-02-04T09:38:06.269705] frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7effc877fd87 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
[Training] [2024-02-04T09:38:06.269711] frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7effc98faf66 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[Training] [2024-02-04T09:38:06.269716] frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7effc98fe4bd in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[Training] [2024-02-04T09:38:06.269722] frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7effc98ff0b9 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[Training] [2024-02-04T09:38:06.269729] frame #4: <unknown function> + 0xd6df4 (0x7f0017380df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[Training] [2024-02-04T09:38:06.269734] frame #5: <unknown function> + 0x8609 (0x7f001848f609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[Training] [2024-02-04T09:38:06.269739] frame #6: clone + 0x43 (0x7f00185c9353 in /lib/x86_64-linux-gnu/libc.so.6)
[Training] [2024-02-04T09:38:06.269744] 
[Training] [2024-02-04T09:38:06.269750] terminate called after throwing an instance of 'c10::DistBackendError'
[Training] [2024-02-04T09:38:06.269756]   what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=548, OpType=ALLREDUCE, NumelIn=8398850, NumelOut=8398850, Timeout(ms)=300000) ran for 300126 milliseconds before timing out.
[Training] [2024-02-04T09:38:06.269762] Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
[Training] [2024-02-04T09:38:06.269768] frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7effc877fd87 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
[Training] [2024-02-04T09:38:06.269773] frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7effc98faf66 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[Training] [2024-02-04T09:38:06.269778] frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7effc98fe4bd in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[Training] [2024-02-04T09:38:06.269783] frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7effc98ff0b9 in /root/ai-voice-cloning/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[Training] [2024-02-04T09:38:06.269788] frame #4: <unknown function> + 0xd6df4 (0x7f0017380df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[Training] [2024-02-04T09:38:06.269793] frame #5: <unknown function> + 0x8609 (0x7f001848f609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[Training] [2024-02-04T09:38:06.269798] frame #6: clone + 0x43 (0x7f00185c9353 in /lib/x86_64-linux-gnu/libc.so.6)
[Training] [2024-02-04T09:38:06.269803] 
[Training] [2024-02-04T09:38:06.269808] Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
/root/ai-voice-cloning/venv/lib/python3.10/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'selection' is deprecated.
   Use 'selection_point()' or 'selection_interval()' instead; these functions also include more helpful docstrings.
  warnings.warn(message, AltairDeprecationWarning, stacklevel=1)
/root/ai-voice-cloning/venv/lib/python3.10/site-packages/altair/vegalite/v5/api.py:469: AltairDeprecationWarning: The types 'single' and 'multi' are now
        combined and should be specified using "selection_point()".
  warnings.warn(
/root/ai-voice-cloning/venv/lib/python3.10/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning: 'add_selection' is deprecated. Use 'add_params' instead.
  warnings.warn(message, AltairDeprecationWarning, stacklevel=1)

Thank you in advanced