RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
After this, the GPU got lost. Typing nvidia-smi gave
Unable to determine the device handle for GPU 0000:02.00.0: Unknown Error
Unfortunately this is all information the terminal displayed. However, by going through this discussion, I can conditionally make the code run by doing one of these:
1. Set CUDA_LAUNCH_BLOCKING=1. Surprisingly, this did not provide additional error info. Instead, the program ran properly after setting this, with the cost of 5x-6x training time.
2. Reduce number of workers to 0 and set non_blocking=False when transferring input images/labels to GPU. This also worked, with the cost of 6x-8x data loading time.
Another observation is that when I increased the batch size, the errors sometimes occurred even though the program only used 2/3 of the GPU memory.
Here are my system configurations:
GPU: 2080Ti
OS: Ubuntu 18.04
Nvidia driver version: 495.29.05
CUDA version: 11.5
PyTorch version: tried 1.9 to 1.11, they all had similar issues
Thank you! Yes, the Xid 79 is helpful as it explains that the “GPU has fallen off the bus”, which can be a HW error, driver error, system memory corruption or thermal issue.
I think a few weeks ago I’ve seen the same issue where the power plug wasn’t fully connected to the GPU and caused the same issue.
Given that, Xid 79 should not be raised by user code, so I don’t believe your PyTorch code (or any library) is the root cause.
I was experiencing this issue on my HP laptop with Nvidia graphics when developing inside the docker container. Turns out if the laptop goes to sleep, then after waking up the laptop, the graphics driver goes into some undesired state. Only restarting the machine work.