I’m hitting the following issues a lot. Is there a way to set the timeout from 30 minutes to infinity so that I can check the details.
[1,158]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 158] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803931 milliseconds before timing out.
[1,105]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 105] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804038 milliseconds before timing out.
[1,110]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 110] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804067 milliseconds before timing out.
[1,207]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 207] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804115 milliseconds before timing out.
[1,248]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 248] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804127 milliseconds before timing out.
[1,202]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 202] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804206 milliseconds before timing out.
[1,106]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 106] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804245 milliseconds before timing out.
[1,104]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 104] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804245 milliseconds before timing out.
[1,251]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 251] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804274 milliseconds before timing out.
[1,205]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 205] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804314 milliseconds before timing out.
[1,203]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 203] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804386 milliseconds before timing out.
[1,252]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 252] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804396 milliseconds before timing out.
[1,254]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 254] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804410 milliseconds before timing out.
[1,255]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 255] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804451 milliseconds before timing out.