I’m developing on GCP instances with A100 GPUs. Ubuntu 18.04. I’ve had no trouble running Python scripts with pytorch on GPU. I’ve recreated one of our models in C++ using the libtorch C++ interface. It runs successfully on CPU but I’ve been unable to get it to run on GPU.
Running this script:
#include <torch/torch.h>
#include <iostream>
int main() {
std::cout << torch::cuda::device_count() << std::endl;
}
produces this error:
[W CUDAFunctions.cpp:109] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (function operator())
0
You can see CUDA initialization failed with out of memory (function operator())
and no CUDA devices are found.
This is a bad error because no other process is using the GPU and you can see the VRAM is completely free:
$ nvidia-smi
Sat May 15 00:16:02 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:00:04.0 Off | 0 |
| N/A 35C P0 44W / 350W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
In addition, the system RAM is using 2 out of 80GB.
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
The CUDA installation appears to be fine. The samples execute without issue:
$ /usr/local/cuda/samples/bin/x86_64/linux/release/cppIntegration
GPU Device 0: "Ampere" with compute capability 8.0
Hello World.
Hello World.
$ /usr/local/cuda/samples/bin/x86_64/linux/release/matrixMulCUBLAS
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "Ampere" with compute capability 8.0
GPU Device 0: "A100-SXM4-40GB" with compute capability 8.0
MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 8408.76 GFlop/s, Time= 0.023 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$ /usr/local/cuda/samples/bin/x86_64/linux/release/cudaTensorCoreGemm
Initializing...
GPU Device 0: "Ampere" with compute capability 8.0
M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm
Time: 1.990656 ms
TFLOPS: 69.04
It appears to be correctly identified by CMake during the build:
-- Found CUDA: /usr/local/cuda-11.1 (found version "11.1")
-- Caffe2: CUDA detected: 11.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda-11.1/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda-11.1
-- Caffe2: Header version is: 11.1
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
-- Found cuDNN: v8.2.0 (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
CMake Warning at /home/xander/libtorch/share/cmake/Caffe2/public/cuda.cmake:198 (message):
Failed to compute shorthash for libnvrtc.so
Call Stack (most recent call first):
/home/xander/libtorch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
/home/xander/dev/vcpkg/scripts/buildsystems/vcpkg.cmake:861 (_find_package)
/home/xander/libtorch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
/home/xander/dev/vcpkg/scripts/buildsystems/vcpkg.cmake:861 (_find_package)
CMakeLists.txt:30 (find_package)
-- Autodetected CUDA architecture(s): 8.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
-- Found Torch: /home/xander/libtorch/lib/libtorch.so
Note that the warning Failed to compute shorthash for libnvrtc.so
no longer appears on the nightly build of libtorch, but the CUDA initialization failure remains.
Restarting the machine has no effect on the issue.
I did sudo rm -r /usr/local/cuda*
and re-installed CUDA, the NVIDIA driver, and cudnn from NVIDIA’s .deb packages. Again, same result. Nothing changed.
On the same machine I am able to run pytorch models from Python on the GPU without issue. I do NOT have conda’s cudatoolkit
installed, so it’s using the same cuda installation as libtorch. This is confirmed when I delete the /usr/ installation of cuda, the Python code ceases to be able to access the GPU.
My issue appears to be similar to this issue except that my CUDA samples run fine and I am having trouble only in libtorch, not in pytorch.
pytorch was installed with:
pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
libtorch was installed with:
wget https://download.pytorch.org/libtorch/cu111/libtorch-cxx11-abi-shared-with-deps-1.8.1%2Bcu111.zip
unzip libtorch-cxx11-abi-shared-with-deps-1.8.1+cu111.zip
I’ve tried building with GCC 10.3.0 as well as clang 12.0.0. Same result with both. cmake 3.20.1. In-between builds I am cleaning with rm -r CMakeFiles CMakeCache.txt
.
Does anyone have any additional debugging steps or thoughts on why I am unable to access the GPU specifically in libtorch C++?