How to set NCCL timeout to infinity

I’m hitting the following issues a lot. Is there a way to set the timeout from 30 minutes to infinity so that I can check the details.

[1,158]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 158] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803931 milliseconds before timing out.
[1,105]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 105] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804038 milliseconds before timing out.
[1,110]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 110] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804067 milliseconds before timing out.
[1,207]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 207] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804115 milliseconds before timing out.
[1,248]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 248] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804127 milliseconds before timing out.
[1,202]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 202] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804206 milliseconds before timing out.
[1,106]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 106] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804245 milliseconds before timing out.
[1,104]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 104] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804245 milliseconds before timing out.
[1,251]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 251] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804274 milliseconds before timing out.
[1,205]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 205] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804314 milliseconds before timing out.
[1,203]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 203] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804386 milliseconds before timing out.
[1,252]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 252] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804396 milliseconds before timing out.
[1,254]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 254] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804410 milliseconds before timing out.
[1,255]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 255] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804451 milliseconds before timing out.

When initializing the process group, you can pass in a very high timeout
to init_process group to simulate an infinite timeout: pytorch/distributed_c10d.py at master · pytorch/pytorch · GitHub

Although, this error implies that some rank was stuck / crashed / desynchronized as in general it is not expected a collective will wait for 30 minutes to finish. One option to debug further might be to use TORCH_DISTRIBUTED_DEBUG=DETAIL as documented here: Distributed communication package - torch.distributed — PyTorch 1.10 documentation

1 Like

tried to set the timeout value to 10 days, and set NCCL_ASYNC_ERROR_HANDLING as ‘1’. But, it still crashes with 30 minutes. It seems like init_process_group() does not control the timeout for allreduce?