Hi.
I have the same error message, however it is transient for me, i.e., happens randomly either during training or validation. I have seen the issue occur only with DDP (multiple GPUs), single GPU runs without DDP work fine.
I am also using multiple data workers for training and validation.
I am using torch==1.9.0+cu111 and Pytorch-lightning==1.4.2.
Here is the start of the error:
terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1089 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f73757b9a22 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
...