Hi, recently my PyTorch ran into an issue:
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
After this, the GPU got lost. Typing nvidia-smi
gave
Unable to determine the device handle for GPU 0000:02.00.0: Unknown Error
Unfortunately this is all information the terminal displayed. However, by going through this discussion, I can conditionally make the code run by doing one of these:
1. Set CUDA_LAUNCH_BLOCKING=1. Surprisingly, this did not provide additional error info. Instead, the program ran properly after setting this, with the cost of 5x-6x training time.
2. Reduce number of workers to 0 and set non_blocking=False when transferring input images/labels to GPU. This also worked, with the cost of 6x-8x data loading time.
Another observation is that when I increased the batch size, the errors sometimes occurred even though the program only used 2/3 of the GPU memory.
Here are my system configurations:
GPU: 2080Ti
OS: Ubuntu 18.04
Nvidia driver version: 495.29.05
CUDA version: 11.5
PyTorch version: tried 1.9 to 1.11, they all had similar issues
Please help! Thanks!