CUDA error: unspecified launch failure

Shen · June 1, 2022, 5:32pm

Hi, recently my PyTorch ran into an issue:

RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

After this, the GPU got lost. Typing nvidia-smi gave

Unable to determine the device handle for GPU 0000:02.00.0: Unknown Error

Unfortunately this is all information the terminal displayed. However, by going through this discussion, I can conditionally make the code run by doing one of these:

1. Set CUDA_LAUNCH_BLOCKING=1. Surprisingly, this did not provide additional error info. Instead, the program ran properly after setting this, with the cost of 5x-6x training time.
2. Reduce number of workers to 0 and set non_blocking=False when transferring input images/labels to GPU. This also worked, with the cost of 6x-8x data loading time.

Another observation is that when I increased the batch size, the errors sometimes occurred even though the program only used 2/3 of the GPU memory.

Here are my system configurations:

GPU: 2080Ti
OS: Ubuntu 18.04
Nvidia driver version: 495.29.05
CUDA version: 11.5
PyTorch version: tried 1.9 to 1.11, they all had similar issues

Please help! Thanks!

ptrblck · June 1, 2022, 6:38pm

Could you post a minimal, executable code snippet to reproduce the issue on the 2080Ti, please?

Shen · June 1, 2022, 6:42pm

Hi, I simply tried the code from PyTorch example here, without any parallelization or distributed features (as I only have 1 GPU).

ptrblck · June 1, 2022, 7:49pm

Do you see any dmesg Xid errors as I don’t know if the reported error in nvidia-smi is the root cause or a symptom of the previous issue?

Shen · June 1, 2022, 9:49pm

Do you mean the output of dmesg command? I ran dmesg | grep GPU, here is the output when GPU functioned properly (nothing is running):

And here is the output after the above error occured:

ptrblck · June 2, 2022, 1:10am

Thank you! Yes, the Xid 79 is helpful as it explains that the “GPU has fallen off the bus”, which can be a HW error, driver error, system memory corruption or thermal issue.
I think a few weeks ago I’ve seen the same issue where the power plug wasn’t fully connected to the GPU and caused the same issue.
Given that, Xid 79 should not be raised by user code, so I don’t believe your PyTorch code (or any library) is the root cause.

Shen · June 2, 2022, 6:14pm

Thanks for your help!

Mahesha999 · September 27, 2024, 11:18am

I was experiencing this issue on my HP laptop with Nvidia graphics when developing inside the docker container. Turns out if the laptop goes to sleep, then after waking up the laptop, the graphics driver goes into some undesired state. Only restarting the machine work.