Validation crashes when num_workers > 0 with CUDA initialization error

tinan5 · August 25, 2021, 11:44am

Hi.
I have the same error message, however it is transient for me, i.e., happens randomly either during training or validation. I have seen the issue occur only with DDP (multiple GPUs), single GPU runs without DDP work fine.
I am also using multiple data workers for training and validation.

I am using torch==1.9.0+cu111 and Pytorch-lightning==1.4.2.

Here is the start of the error:

terminate called after throwing an instance of 'c10::CUDAError'                                                                                                                           what():  CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.                                                                  For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                                  Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1089 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f73757b9a22 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)                                     
...