I was doing distributed training of my HRNet model using my 2 GPU. so after certain iteration of batches, it showed this issue. In the meanwhile, when i checked using nvidia-smi on cmd, it showed me this error:- Unable to determine the device handle for GPU 0000:67:00.0: Unknown Error…
I have been saving my checkpoint every 2500 iteration, and my model is of size 1.67GB.
RuntimeError: NCCL communicator was aborted on rank 0.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of ‘std::runtime_error’
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4806903 milliseconds before timing out.
Please i needed help on this