PyTorch DDP training, 2 nodes * 8 GPU, at validation step 1440 (approximately 3 and a half hours), reported NCCL timeout error with type ALLGATHER.
Error info
2025-01-22T15:47:36.977Z [rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3472, OpType=ALLGATHER, NumelIn=1, NumelOut=16, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
2025-01-22T15:47:36.980Z [rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 3472, last enqueued NCCL work: 3472, last completed NCCL work: 3471.
2025-01-22T15:47:36.980Z [rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
2025-01-22T15:47:36.980Z [rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
2025-01-22T15:47:36.980Z [rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3472, OpType=ALLGATHER, NumelIn=1, NumelOut=16, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
2025-01-22T15:47:36.980Z Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
2025-01-22T15:47:36.980Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7fd57c798e89 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2025-01-22T15:47:36.980Z frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e1 (0x7fd5190ae121 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2025-01-22T15:47:36.980Z frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd5190b54e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2025-01-22T15:47:36.980Z frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f (0x7fd5190b63ff in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2025-01-22T15:47:36.980Z frame #4: <unknown function> + 0xdc253 (0x7fd629ab0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2025-01-22T15:47:36.980Z frame #5: <unknown function> + 0x94ac3 (0x7fd633b44ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2025-01-22T15:47:36.980Z frame #6: clone + 0x44 (0x7fd633bd5a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2025-01-22T15:47:36.980Z none
2025-01-22T15:47:36.981Z terminate called after throwing an instance of 'c10::DistBackendError'
2025-01-22T15:47:36.981Z what(): [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3472, OpType=ALLGATHER, NumelIn=1, NumelOut=16, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
2025-01-22T15:47:36.981Z Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
2025-01-22T15:47:36.981Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7fd57c798e89 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2025-01-22T15:47:36.981Z frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e1 (0x7fd5190ae121 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2025-01-22T15:47:36.981Z frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd5190b54e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2025-01-22T15:47:36.981Z frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f (0x7fd5190b63ff in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2025-01-22T15:47:36.981Z frame #4: <unknown function> + 0xdc253 (0x7fd629ab0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2025-01-22T15:47:36.981Z frame #5: <unknown function> + 0x94ac3 (0x7fd633b44ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2025-01-22T15:47:36.981Z frame #6: clone + 0x44 (0x7fd633bd5a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2025-01-22T15:47:36.981Z none
2025-01-22T15:47:36.981Z Exception raised from ncclCommWatchdog at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
2025-01-22T15:47:36.981Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7fd57c798e89 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2025-01-22T15:47:36.981Z frame #1: <unknown function> + 0x103534e (0x7fd5190dd34e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2025-01-22T15:47:36.981Z frame #2: <unknown function> + 0xcb0e25 (0x7fd518d58e25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2025-01-22T15:47:36.981Z frame #3: <unknown function> + 0xdc253 (0x7fd629ab0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2025-01-22T15:47:36.981Z frame #4: <unknown function> + 0x94ac3 (0x7fd633b44ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2025-01-22T15:47:36.981Z frame #5: clone + 0x44 (0x7fd633bd5a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2025-01-22T15:47:36.981Z none
CPU usage and mem usage
GPU usage and GPU mem usage
I set NCCL_DEBUG=INFO, and the logs of other ranks seem to be normal, only rank[2] shows connection closed from other ranks, but only this information is available. Can anyone help me? Thanks.
cnwla-a800-p01072:1267938:1278813 [2] NCCL INFO [Service thread] Connection closed by localRank 5
cnwla-a800-p01072:1267938:1278813 [2] NCCL INFO [Service thread] Connection closed by localRank 7
cnwla-a800-p01072:1267938:1278813 [2] NCCL INFO [Service thread] Connection closed by localRank 3
cnwla-a800-p01072:1267938:1278813 [2] NCCL INFO [Service thread] Connection closed by localRank 6
cnwla-a800-p01072:1267938:1278813 [2] NCCL INFO [Service thread] Connection closed by localRank 1
cnwla-a800-p01072:1267938:1278813 [2] NCCL INFO [Service thread] Connection closed by localRank 4