Hi, recently my PyTorch ran into an issue:
RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
After this, the GPU got lost. Typing
Unable to determine the device handle for GPU 0000:02.00.0: Unknown Error
Unfortunately this is all information the terminal displayed. However, by going through this discussion, I can conditionally make the code run by doing one of these:
1. Set CUDA_LAUNCH_BLOCKING=1. Surprisingly, this did not provide additional error info. Instead, the program ran properly after setting this, with the cost of 5x-6x training time. 2. Reduce number of workers to 0 and set non_blocking=False when transferring input images/labels to GPU. This also worked, with the cost of 6x-8x data loading time.
Another observation is that when I increased the batch size, the errors sometimes occurred even though the program only used 2/3 of the GPU memory.
Here are my system configurations:
GPU: 2080Ti OS: Ubuntu 18.04 Nvidia driver version: 495.29.05 CUDA version: 11.5 PyTorch version: tried 1.9 to 1.11, they all had similar issues
Please help! Thanks!