Unable to troubleshoot RuntimeError: CUDA error: no kernel image is available for execution on the device

tangolin · March 23, 2022, 5:42am

Full error trace:

File "/clipbert/src/modeling/transformers.py", line 198, in forward
    embeddings = self.dropout(embeddings)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/dropout.py", line 54, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 936, in dropout
    else _VF.dropout(input, p, training))
RuntimeError: CUDA error: no kernel image is available for execution on the device

After browsing related topics, I realised that most seems to be that the torch and gpu compute compatibility is incompatible, however running

cuobjdump build/lib.linux-x86_64-3.7/torch/lib/libtorch.so | grep arch | sort | uniq

from another thread just returns me

cuobjdump info    : File '/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so' does not contain device code

and I have no idea what to make of this error message.

Here is the info about my environment, appreciate it if someone can confirm whether I have a compute compatibility issue.

Collecting environment information...
PyTorch version: 1.5.1+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.14.0

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: Tesla K80
GPU 1: Tesla K80
GPU 2: Tesla K80
GPU 3: Tesla K80

Nvidia driver version: 430.64
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4

Versions of relevant libraries:
[pip] msgpack-numpy==0.4.3.2
[pip] numpy==1.17.2
[pip] torch==1.5.1+cu101
[pip] torchtext==0.4.0
[pip] torchvision==0.6.1+cu101
[conda] magma-cuda100             2.1.0                         5    local
[conda] mkl                       2019.1                      144  
[conda] mkl-include               2019.1                      144  
[conda] nomkl                     3.0                           0  
[conda] torch                     1.5.1+cu101              pypi_0    pypi
[conda] torchtext                 0.4.0                    pypi_0    pypi
[conda] torchvision               0.6.1+cu101              pypi_0    pypi

ptrblck · March 23, 2022, 6:49am

The K80 has a compute capability of 3.7 and should be supported in current binaries.
Could you update to the latest stable release and rerun your code?

You are not able to find any device code in libtorch.so as it would ship in libtorch_cuda.so and you can check the supported architectures via print(torch.cuda.get_atch_list()).

tangolin · March 23, 2022, 7:47am

@ptrblck unfortunately there are a lot of other dependency issues when I use newer versions of pytorch so I am stuck with torch1.5.1, is there any other possible causes for this(such as nvidia driver version)?

ptrblck · March 23, 2022, 10:04am

This mentioned CUDA error should be raised by a missing architecture in the binaries and not caused by the driver etc.
However, the 1.5.1 wheels for CUDA10.2 and 10.1 contain these device codes:

arch = sm_35
arch = sm_37
arch = sm_50
arch = sm_60
arch = sm_61
arch = sm_70
arch = sm_75