CUDA kernel failed : no kernel image is available for execution on the device, Error when running PyTorch model inside Google Compute VM

antmillar · March 24, 2020, 10:43am

I have a docker image of a PyTorch model that returns this error when run inside a GCE VM running on debian/Tesla P4 GPU/google deep learning image:

CUDA kernel failed : no kernel image is available for execution on the device

This occurs on the line where my model is called. The PyTorch model includes custom c++ extensions, I’m using this model https://github.com/daveredrum/Pointnet2.ScanNet

My image installs these at runtime

The image runs fine on my local system. Both VM and my system have these versions:

Cuda compilation tools 10.1, V10.1.243

torch 1.4.0

torchvision 0.5.0

The main difference is the GPU as far as I’m aware

Local:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 960M    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   36C    P8    N/A /  N/A |    361MiB /  2004MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

VM:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    23W /  75W |      0MiB /  7611MiB |      3%      Default |

If I ssh into the VM torch.cuda.is_available() returns true

Therefore I suspect it must be something to do with the compilation of the extensions

This is the relevant part of my docker file:

ENV CUDA_HOME "/usr/local/cuda-10.1"
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda-10.1/bin:${PATH}
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NVIDIA_REQUIRE_CUDA "cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 brand=tesla,driver>=418,driver<419"
ENV FORCE_CUDA=1

# CUDA 10.1-specific steps
RUN conda install -c open3d-admin open3d
RUN conda install -y -c pytorch \
    cudatoolkit=10.1 \
    "pytorch=1.4.0=py3.6_cuda10.1.243_cudnn7.6.3_0" \
    "torchvision=0.5.0=py36_cu101" \
 && conda clean -ya
RUN pip install -r requirements.txt
RUN pip install flask
RUN pip install plyfile
RUN pip install scipy


# Install OpenCV3 Python bindings
RUN sudo apt-get update && sudo apt-get install -y --no-install-recommends \
    libgtk2.0-0 \
    libcanberra-gtk-module \
    libgl1-mesa-glx \
 && sudo rm -rf /var/lib/apt/lists/*

RUN dir
RUN cd pointnet2 && python setup.py install
RUN cd ..

I have already re-running this line from ssh in the VM:

TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0" python setup.py install

Which I think targets the installation to the Tesla P4 compute capability?

Is there some other setting or troubleshooting step I can try?

I didn’t know anything about docker/VMs/pytorch extensions until a couple of days ago, so somewhat shooting in the dark. I published this on stack already, but figured maybe this is a better venue!

ptrblck · March 25, 2020, 4:18am

The install instructions look correct.
Could you post the output of the installation here, please?

antmillar · March 25, 2020, 9:48am

$ $ dir   

__pycache__  dist                pointnet2_modules.py  pointnet2_utils.py  setup.py
build        pointnet2.egg-info  pointnet2_semseg.py   pytorch_utils.py    src

$ TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0" python setup.py install

running install
running bdist_egg
running egg_info
writing pointnet2.egg-info/PKG-INFO
writing dependency_links to pointnet2.egg-info/dependency_links.txt
writing top-level names to pointnet2.egg-info/top_level.txt
reading manifest file 'pointnet2.egg-info/SOURCES.txt'
writing manifest file 'pointnet2.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.6/pointnet2_cuda.cpython-36m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating stub loader for pointnet2_cuda.cpython-36m-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/pointnet2_cuda.py to pointnet2_cuda.cpython-36.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying pointnet2.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying pointnet2.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying pointnet2.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying pointnet2.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.pointnet2_cuda.cpython-36: module references __file__
creating 'dist/pointnet2-0.0.0-py3.6-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing pointnet2-0.0.0-py3.6-linux-x86_64.egg
removing '/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pointnet2-0.0.0-py3.6-linux-x86_64.egg' (and e
verything under it)
creating /home/user/miniconda/envs/py36/lib/python3.6/site-packages/pointnet2-0.0.0-py3.6-linux-x86_64.egg
Extracting pointnet2-0.0.0-py3.6-linux-x86_64.egg to /home/user/miniconda/envs/py36/lib/python3.6/site-packages
pointnet2 0.0.0 is already the active version in easy-install.pth
Installed /home/user/miniconda/envs/py36/lib/python3.6/site-packages/pointnet2-0.0.0-py3.6-linux-x86_64.egg
Processing dependencies for pointnet2==0.0.0
Finished processing dependencies for pointnet2==0.0.0

ptrblck · March 26, 2020, 6:34am

Thanks for the log.
Could you try to add your compute capability here as:

nvcc':['-O2',
       '-gencode', 'arch=compute_61,code=sm_61',
        ...

antmillar · March 26, 2020, 10:24am

Thanks a lot for your help

I tried it, unfortunately getting the same error still. I ssh into the docker container in the vm and added that line to setup.py, then ran:

setup.py install

I also tried changing that line and then running

TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0" python setup.py install

but still get the same error

Will running setup.py install overwrite the extensions installed on the initial docker image build, or do I need to build that again from scratch with the modified code?

ptrblck · March 27, 2020, 5:14am

I would uninstall all previous instllations to make sure you are really using the new build.

antmillar · March 30, 2020, 5:02pm

Thanks lot for your help, got it working

I deleted all the files in the setup.py folder other than the src folder of c++ and then rebuilt using

TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0" python setup.py install

and that worked, didn’t need to edit the setup.py file in the end