CUDA error: unspecified launch failure

Hi, recently my PyTorch ran into an issue:

RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

After this, the GPU got lost. Typing nvidia-smi gave

Unable to determine the device handle for GPU 0000:02.00.0: Unknown Error

Unfortunately this is all information the terminal displayed. However, by going through this discussion, I can conditionally make the code run by doing one of these:

1. Set CUDA_LAUNCH_BLOCKING=1. Surprisingly, this did not provide additional error info. Instead, the program ran properly after setting this, with the cost of 5x-6x training time.
2. Reduce number of workers to 0 and set non_blocking=False when transferring input images/labels to GPU. This also worked, with the cost of 6x-8x data loading time. 

Another observation is that when I increased the batch size, the errors sometimes occurred even though the program only used 2/3 of the GPU memory.

Here are my system configurations:

GPU: 2080Ti
OS: Ubuntu 18.04
Nvidia driver version: 495.29.05
CUDA version: 11.5
PyTorch version: tried 1.9 to 1.11, they all had similar issues

Please help! Thanks!

Could you post a minimal, executable code snippet to reproduce the issue on the 2080Ti, please?

Hi, I simply tried the code from PyTorch example here, without any parallelization or distributed features (as I only have 1 GPU).

Do you see any dmesg Xid errors as I don’t know if the reported error in nvidia-smi is the root cause or a symptom of the previous issue?

Do you mean the output of dmesg command? I ran dmesg | grep GPU, here is the output when GPU functioned properly (nothing is running):

And here is the output after the above error occured:

Thank you! Yes, the Xid 79 is helpful as it explains that the “GPU has fallen off the bus”, which can be a HW error, driver error, system memory corruption or thermal issue.
I think a few weeks ago I’ve seen the same issue where the power plug wasn’t fully connected to the GPU and caused the same issue.
Given that, Xid 79 should not be raised by user code, so I don’t believe your PyTorch code (or any library) is the root cause.

Thanks for your help!