Get RuntimeError: CUDA error: no kernel image is available for execution on the device on weight initialisation with cuda 10.2, torchvision 0.6.0 and torch 1.5.0

VFernandez · November 3, 2020, 2:12pm

I am deploying a neural network in my Ubuntu machine and when the weights are initialised, I get an error: “RuntimeError: CUDA error: no kernel image is available for execution on the device”.

In nvidia-smi, the CUDA version is 10.2.
I have 2 GPUs. One of them (K40c) is very old and requires a low version of torchvision (0.4.0) but I am not using it: I specify cuda:1 and make sure that 1 points to the newest GPU device (Titan V).

Going to https://pytorch.org/get-started/previous-versions/, I made sure I had installed torch and torchvision so that they were compatible with version 10.2:

torch version: 1.5.0
torchvision: 0.6.0

Still, I have this issue, in particular in line:

init.xavier_normal_(m.weight.data, gain=gain)
File “…/python3.6/site-packages/torch/nn/init.py”, line 282, in xavier_normal_
return no_grad_normal(tensor, 0., std)
File “…/python3.6/site-packages/torch/nn/init.py”, line 19, in no_grad_normal
return tensor.normal_(mean, std)
RuntimeError: CUDA error: no kernel image is available for execution on the device.

Is it possible that the old GPU, despite not being used, is still causing this problem?

Thank you.

ptrblck · November 3, 2020, 11:54pm

Could you rerun the code with

CUDA_VISIBLE_DEVICES=id python script.py args

where id should be 0 or 1 depending which GPU is mapped to the device id and make sure that only the TitanV is found?

The error could also point to an NVIDIA driver, which is too old.
For CUDA10.2, you would need >=440.33 as given in this table.

VFernandez · November 11, 2020, 8:42am

Hello! Thank you for your answer.
Using CUDA_VISIBLE_DEVICES doesn’t work, same error is popping out.
The driver version is 440.100, so it should be okay according to the NVIDIA website.

ptrblck · November 11, 2020, 10:35am

I don’t know why it shouldn’t work, as I’m using a TitanV myself with the binaries, different CUDA versions, and builds from source.
If CUDA_VISIBLE_DEVICES is not working, you could try to disable or remove the old GPU for the sake of debugging and rerun your script.