[rank4]:[E ProcessGroupNCCL.cpp:523] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
[rank7]:[E ProcessGroupNCCL.cpp:523] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
[rank5]:[E ProcessGroupNCCL.cpp:523] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054175 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
[rank6]:[E ProcessGroupNCCL.cpp:523] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feee5debd87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7feee6f936e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7feee6f96c3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7feee6f97839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fef30cb6e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fef31fba609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fef31d7b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feee5debd87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7feee6f936e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7feee6f96c3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7feee6f97839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fef30cb6e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fef31fba609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fef31d7b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feee5debd87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7feee6cedb11 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7fef30cb6e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #3: + 0x8609 (0x7fef31fba609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fef31d7b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
/opt/conda/envs/develop/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
[rank4]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E ProcessGroupNCCL.cpp:1182] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa83b8b8d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fa83ca606e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fa83ca63c3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fa83ca64839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fa886783e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fa887a87609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fa887848353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa83b8b8d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fa83ca606e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fa83ca63c3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fa83ca64839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fa886783e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fa887a87609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fa887848353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa83b8b8d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7fa83c7bab11 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7fa886783e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #3: + 0x8609 (0x7fa887a87609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fa887848353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank5]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E ProcessGroupNCCL.cpp:1182] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f75160d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f2f763086e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f2f7630bc3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2f7630c839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7f2fc002be95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f2fc132f609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f2fc10f0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f75160d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f2f763086e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f2f7630bc3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2f7630c839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7f2fc002be95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f2fc132f609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f2fc10f0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f75160d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7f2f76062b11 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7f2fc002be95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #3: + 0x8609 (0x7f2fc132f609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f2fc10f0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank6]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E ProcessGroupNCCL.cpp:1182] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd7da026d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd7db1ce6e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd7db1d1c3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd7db1d2839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fd824ef1e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fd8261f5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fd825fb6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd7da026d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd7db1ce6e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd7db1d1c3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd7db1d2839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fd824ef1e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fd8261f5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fd825fb6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd7da026d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7fd7daf28b11 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7fd824ef1e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #3: + 0x8609 (0x7fd8261f5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fd825fb6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2bbdda1d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f2bbef496e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f2bbef4cc3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2bbef4d839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7f2c08c6ce95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f2c09f70609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f2c09d31353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2bbdda1d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f2bbef496e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f2bbef4cc3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2bbef4d839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7f2c08c6ce95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f2c09f70609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f2c09d31353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2bbdda1d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7f2bbeca3b11 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7f2c08c6ce95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #3: + 0x8609 (0x7f2c09f70609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f2c09d31353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[2024-03-06 16:30:46,076] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2056 closing signal SIGTERM
[2024-03-06 16:30:46,077] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2057 closing signal SIGTERM
[2024-03-06 16:30:46,078] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2063 closing signal SIGTERM
[2024-03-06 16:30:47,057] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 2 (pid: 2058) of binary: /opt/conda/envs/develop/bin/python3.10
Traceback (most recent call last):
File “/opt/conda/envs/develop/bin/accelerate”, line 8, in
sys.exit(main())
File “/opt/conda/envs/develop/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py”, line 47, in main
args.func(args)
File “/opt/conda/envs/develop/lib/python3.10/site-packages/accelerate/commands/launch.py”, line 1010, in launch_command
multi_gpu_launcher(args)
File “/opt/conda/envs/develop/lib/python3.10/site-packages/accelerate/commands/launch.py”, line 672, in multi_gpu_launcher
distrib_run.run(args)
File “/opt/conda/envs/develop/lib/python3.10/site-packages/torch/distributed/run.py”, line 803, in run
elastic_launch(
File “/opt/conda/envs/develop/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/opt/conda/envs/develop/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
mllm/pipeline/finetune.py FAILED
and this is a Error.