Multi GPU training timeout error: WorkNCCL(OpType=ALLGATHER, Timeout

I am running the codebase in GitHub - salesforce/MUST: PyTorch code for MUST as it is, in 16 V100 GPUs.

After training, during the evaluation, I get the following error. The error occurs for the Imagenet dataset. For a smaller dataset such as Caltech101, it does not happen. Increasing the timeout limit beyond 1800 seconds just delays in which iteration the error occurs during the evaluation.

Solving this would be a great start to the new year :wink:

Test: [ 2090/10010] eta: 1:53:32 acc1: 73.3203 (76.6279) ema_acc1: 77.7734 (76.4956) time: 0.8568 data: 0.0

001 max mem: 14135
[E ProcessGroupNCCL.cpp:587] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800172 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:587] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801901 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801862 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801679 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802173 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802073 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801667 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801821 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802399 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802025 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802703 milliseconds before timing out.
Test: [ 2100/10010] eta: 1:53:23 acc1: 79.5312 (76.6506) ema_acc1: 81.9922 (76.5298) time: 0.8566 data: 0.0
001 max mem: 14135
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1546 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1549 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1553 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1554 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1555 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1556 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1558 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1562 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1563 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1564 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1565 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1566 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1570 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1574 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 12 (pid: 1567) of binary: /
nfs/users/ext_jameel.hassan/anaconda3/envs/must/bin/python
Traceback (most recent call last): File “/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/runpy.py”, line 194, in _run_module_as_mai n
return _run_code(code, main_globals, None,
File “/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py”, l
ine 723, in
main()
File “/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/elastic/mu
ltiprocessing/errors/init.py”, line 345, in wrapper
return f(*args, **kwargs)
File “/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py”, l
ine 719, in main
run(args)
File “/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py”, l
ine 710, in run
elastic_launch(
File “/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError(torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:

[1]:
time : 2023-01-02_15:24:36
host : HOST
rank : 15 (local_rank: 15)
exitcode : -6 (pid: 1575)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1575

Root Cause (first observed failure):

[0]:
time : 2023-01-02_15:24:36
host : HOST
rank : 12 (local_rank: 12)
exitcode : -6 (pid: 1567)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1567

Found the bug. I was using the train images for validation which caused the timeout. I believe that is because the evaluation is run on a single GPU, and when the time limit of 30mins is reached it kills the process.

1 Like