[Urgent Help required] Installed self-compiled PyTorch CUDA-enabled Pointnet2 Extension Module have conflicting CUDA library (.so file) version requirements when imported in Python 3.6

Nich_010 · December 15, 2020, 9:00pm

Hello, first time posting here so apologies firsthand if it’s an already asked question or any mistakes made.

I’m currently trying to compile a pointnet2 PyTorch implementation as a function library/module from this repo which borrowed the idea from this previous repo.

For reference, I’m currently trying to perform this within a Virtual Environment that uses Python 3.6.9, with Pytorch version 1.7.1+cu110, CUDA version 11.1, NVCC version 10.1. Machine is running Pop_OS 18.04 LTS which is a derivative of the Ubuntu 18.04 LTS OS.

NVCC --version output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

nvidia-smi output:

Wed Dec 16 04:53:03 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.28       Driver Version: 455.28       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| 48%   30C    P8    18W / 250W |    837MiB / 11177MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1343      G   /usr/bin/gnome-shell               92MiB |
|    0   N/A  N/A      1771      G   /usr/lib/xorg/Xorg                378MiB |
|    0   N/A  N/A      1862      G   /usr/bin/gnome-shell              215MiB |
|    0   N/A  N/A      2227      G   ...oken=14138817340175859824       12MiB |
|    0   N/A  N/A      2465      G   ...AAAAAAAAA= --shared-files       83MiB |
|    0   N/A  N/A      5915      G   ...AAAAAAAAA= --shared-files       11MiB |
|    0   N/A  N/A     11099      G   ...AAAAAAAAA= --shared-files       24MiB |
+-----------------------------------------------------------------------------+

Following the repo instructions, I’ve managed to actually compile the library and have it installed into my virtual environment, as can be seen below:

(test_env) nick_pc@pop-os:~/3DSSD-pytorch/lib/pointnet2$ python setup.py install
running install
running bdist_egg
running egg_info
writing pointnet2.egg-info/PKG-INFO
writing dependency_links to pointnet2.egg-info/dependency_links.txt
writing top-level names to pointnet2.egg-info/top_level.txt
/home/nick_pc/.virtualenvs/qe_env/lib/python3.6/site-packages/torch/utils/cpp_extension.py:352: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'pointnet2.egg-info/SOURCES.txt'
writing manifest file 'pointnet2.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.6/pointnet2_cuda.cpython-36m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating stub loader for pointnet2_cuda.cpython-36m-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/pointnet2_cuda.py to pointnet2_cuda.cpython-36.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying pointnet2.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying pointnet2.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying pointnet2.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying pointnet2.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.pointnet2_cuda.cpython-36: module references __file__
creating 'dist/pointnet2-0.0.0-py3.6-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing pointnet2-0.0.0-py3.6-linux-x86_64.egg
creating /home/nick_pc/.virtualenvs/qe_env/lib/python3.6/site-packages/pointnet2-0.0.0-py3.6-linux-x86_64.egg
Extracting pointnet2-0.0.0-py3.6-linux-x86_64.egg to /home/nick_pc/.virtualenvs/qe_env/lib/python3.6/site-packages
Adding pointnet2 0.0.0 to easy-install.pth file

Installed /home/nick_pc/.virtualenvs/qe_env/lib/python3.6/site-packages/pointnet2-0.0.0-py3.6-linux-x86_64.egg
Processing dependencies for pointnet2==0.0.0
Finished processing dependencies for pointnet2==0.0.0

However, whenever I try to actually import the compiled library, I instead get the following error stating that there isn’t a libcudart.so.9.0, indicating that the library expects CUDA version 9.0 instead.

>>> import pointnet2_cuda as pointnet2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: libcudart.so.9.0: cannot open shared object file: No such file or directory

I’m currently under the impression that the cause of this is because the CUDA code used to define the pointnet2 libary was somehow written to use CUDA 9.0, however I’m very new to CUDA programming so I’m likely quite wrong on this aspect as well. So far all other PyTorch codes that don’t rely on much older versions of PyToch has not faced any similar issue on this machine.

I don’t want to downgrade the CUDA version, so what I’m wondering is whether there is a way to force the library to be compiled using the CUDA version that I have so that it could be used within my Virtual Environment?

Looking forward to any suggestions/help, and thank you for any replies firsthand!

EDIT: Fixed structure of post
EDIT2: Changed the tag to be more appropriate
EDIT3: Changed some wording to clarify the intention of the PyTorch Extension

ptrblck · December 16, 2020, 7:58am

I’m a bit confused about this version mismatch. How are you using the CUDA toolkit 11.1 with nvcc 10.1?

Anyway, by searching in the repo for cuda, it seems that numba is required and uses old CUDA versions as given here?

Your installation also doesn’t show the CUDA compiler at all, so it seems you’ve just installed the “Python-version”?

Nich_010 · December 16, 2020, 8:24am

Hi, and thank you for the reply! Sorry my response was a bit late since I needed to check with some of your questions on my machine.

Now that you mention, I’ve just realized that the NVCC seems to have come from my /usr/lib/cuda-10.1 directory which seems to cause this mismatch. I believe this is because I’ve installed through System76’s version of CUDA Toolkit, which seems to still be stuck at NVCC version 10.1 currently. I’ll try and install the compatible version of the nvcc my self and see how it goes for this one.

With regards to the CUDA Compiler, I was unaware that there is a “Python-version” of the CUDA compiler. May I know how to check whether the CUDA compiler is installed properly in this case then, and what’s the difference between these different versions?

ptrblck · December 16, 2020, 8:26am

There is no Python version of the CUDA compiler and I meant to say that you didn’t use the CUDA compiler while building the 3rd party library and thus I don’t think you’ve compiled any kernels but might use a “Python only” version of the pointnet2 library.

Nich_010 · December 16, 2020, 9:56am

Sorry for the late reply. Oh alright noted on that. I’m in the process of re-installing a hopefully compatible version of NVCC (my CUDA version is 11.1, but the one available from NVIDIA’s download page indicates version 11.2) and will update my progress here. Hopefully it is able to finally fully compile now.

Nich_010 · December 16, 2020, 12:23pm

Hi, I’ve already tried fixing my NVCC installation to match with the CUDA version reported by nvidia-smi, following the instructions provided by NVIDIA here, as well as here. It seems to have worked well, with nvcc --version giving me the following output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

Unfortunately installing the pointnet2 library still yielded the same ImportError: libcudart.so.9.0: cannot open shared object file: No such file or directory error still popping up whenever I try to import the pointnet2 module.

Since you mentioned this indicates the CUDA compiler is not actually used, with me only building a “Python-only” version of the library, may I ask how to actually invoke the compiler to run & build the library from the kernel level I wonder? It’s quite frustrating to not be able to use it on my end just because the library seems to constantly enquire for a lower version of the CUDA library which I don’t want to downgrade to

Alexey_Demyanchuk · December 16, 2020, 1:26pm

This error is always frustrating. Sometimes you can symlink libcudart.so.X file you have on your system to this libcudart.so.9. required by the library you are installing or using. I don’t think it is a good workaround but sometimes it works well.
You can find a file on linux system with a command like this: find / -name "libcudart.so.*"

ptrblck · December 17, 2020, 3:45am

My assumption is based on the provided install log, as it doesn’t show any usage of nvcc.
The repository is responsible to provide install instructions to build all CUDA extensions (if available) and I’m unfortunately not familiar with this particular repository.
By just checking some folders, it seems that at least in this setup.py some .cu files should be compiled, which is not done in your install log. Also some .so files are shipped and I don’t know, if these libs are trying to statically link to libcudart.so.9.0 (you should check it).

Nich_010 · December 18, 2020, 9:38am

Hi, sorry for the very late reply as I was trying to resolve the issue on my end.

I believe I’ve sort of resolved the issue, apparently for whatever reason the NVCC didn’t trigger for the 3DSSD repo which I was focusing on, but instead it triggered when I tried to install the module from original pointnet2 repo which it originally referred to. If I have to guess, it’s likely caused due to either:

The pointnet2 library within the 3DSSD repo had perhaps missing bindings (?), or
The library in the 3DSSD repo was structured such that the C++ libraries bindings weren’t detected during the setup.py install for the pointnet2 library.

Unfortunately, due the code originally written using a much older version of PyTorch (v1.0, vs v1.7.1 that I’m using), I apparently still wasn’t able to build the module from the pointnet2 repo.

However, I luckily checked back and found that apparently the pointnet2 repo itself referenced to an even earlier implementation of the Pointnet2 kernel by Erik Wijmans that has been updated to support the later versions of PyTorch. This repo seemed to have the structure & functions updated properly to later versions of PyTorch so installation went smoothly without any problems for me.

The pointnet2 repo by sshaoshuai are noted to have included some kernel-level which are claimed to improve the speed of the pointnet2 functions though, so I decided to try and combine the features from that repo into Erik Wijman’s implementations.

Fortunately this apporach seemed to work for me, and now I’m able to run the training on the 3DSSD Repo, which ultimately was my goal. I’ll perhaps try and update sshaoshuai’s implementation so that it’s compatible with the latest versions of PyTorch (similar to how Erik Wijman’s implementation works fine for me) seeing this experience, once I have the time.

Thank you once again for the help/suggestions that I’ve received so far (in particular regarding the NVCC and how the CUDA compiler works) though, and I’ll update if there are any other problems that I encountere again in the future for this issue.

ptrblck · December 18, 2020, 7:56pm

That sounds great and I’m glad the repo is now working for you!

ionut · December 12, 2023, 3:13pm

Hi! I will jump into this discussion since I arrived here by looking for a solution for another problem and I think I can help.

It also happened to me to build an extension successfully and get that annoying error about missing libcudart.so.

The solution is to import torch before importing your package since torch loads libcudart.so into the kernel.

Cheers,
Ionut