How to resolve NCCL timeout wait errors during waiting client request?

I have a large model that uses model parallelism by torch. distributed to load. Now, I need to provide a demo for it. In order to avoid time consuming to load model, I load the model at demo startup and wait for the request to trigger the inference. Unfortunately, it doesn’t run for long (no requests for more than 30 minutes) because of an NCCL timeout error. How do I resolve it?

The following is the error message:

RuntimeError: NCCL communicator was aborted on rank 2.  Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805882 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806339 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806453 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806606 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806721 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806808 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806831 milliseconds before timing out.
RuntimeError: NCCL communicator was aborted on rank 1.  Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806831 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806339 milliseconds before timing out.
  what():  [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806453 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 142625 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 142629 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 142630 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 142633 closing signal SIGTERM

Looks like this is the issue:
RuntimeError: NCCL communicator was aborted on rank 1. Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806831 milliseconds before timing out.

The 30 minutes timeout matches exactly what you observed.

You can try the steps from here: Distributed communication package - torch.distributed — PyTorch master documentation to troubleshoot your issue and understand why there’s a collective that’s not finishing on rank 1

Maybe I didn’t describe it very clearly. I know what caused the problem because I told it to wait for a request from the client, but this client is a demo, and it may not be used for a long time (more than 30 minutes, or longer).

One way is to increase the wait time for NCCL, but I can’t estimate the maximum client wait time, and I haven’t found a way to set the wait time for NCCL to infinite.

So I’m not sure if setting a waiting period of some days is the only way? Or is there a better solution?

I met timeout question too. Do you solve it? or Do you find a better solution?