Libtorch CUDA initialization: Unexpected error out of memory

I’m developing on GCP instances with A100 GPUs. Ubuntu 18.04. I’ve had no trouble running Python scripts with pytorch on GPU. I’ve recreated one of our models in C++ using the libtorch C++ interface. It runs successfully on CPU but I’ve been unable to get it to run on GPU.

Running this script:

#include <torch/torch.h>
#include <iostream>

int main() {
    std::cout << torch::cuda::device_count() << std::endl;
}

produces this error:

[W CUDAFunctions.cpp:109] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (function operator())
0

You can see CUDA initialization failed with out of memory (function operator()) and no CUDA devices are found.

This is a bad error because no other process is using the GPU and you can see the VRAM is completely free:

$ nvidia-smi
Sat May 15 00:16:02 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    44W / 350W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

In addition, the system RAM is using 2 out of 80GB.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

The CUDA installation appears to be fine. The samples execute without issue:

$ /usr/local/cuda/samples/bin/x86_64/linux/release/cppIntegration
GPU Device 0: "Ampere" with compute capability 8.0

Hello World.
Hello World.
$ /usr/local/cuda/samples/bin/x86_64/linux/release/matrixMulCUBLAS
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "Ampere" with compute capability 8.0

GPU Device 0: "A100-SXM4-40GB" with compute capability 8.0

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 8408.76 GFlop/s, Time= 0.023 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$ /usr/local/cuda/samples/bin/x86_64/linux/release/cudaTensorCoreGemm
Initializing...
GPU Device 0: "Ampere" with compute capability 8.0

M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm
Time: 1.990656 ms
TFLOPS: 69.04

It appears to be correctly identified by CMake during the build:

-- Found CUDA: /usr/local/cuda-11.1 (found version "11.1")
-- Caffe2: CUDA detected: 11.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda-11.1/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda-11.1
-- Caffe2: Header version is: 11.1
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
-- Found cuDNN: v8.2.0  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
CMake Warning at /home/xander/libtorch/share/cmake/Caffe2/public/cuda.cmake:198 (message):
  Failed to compute shorthash for libnvrtc.so
Call Stack (most recent call first):
  /home/xander/libtorch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home/xander/dev/vcpkg/scripts/buildsystems/vcpkg.cmake:861 (_find_package)
  /home/xander/libtorch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  /home/xander/dev/vcpkg/scripts/buildsystems/vcpkg.cmake:861 (_find_package)
  CMakeLists.txt:30 (find_package)


-- Autodetected CUDA architecture(s):  8.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
-- Found Torch: /home/xander/libtorch/lib/libtorch.so

Note that the warning Failed to compute shorthash for libnvrtc.so no longer appears on the nightly build of libtorch, but the CUDA initialization failure remains.

Restarting the machine has no effect on the issue.

I did sudo rm -r /usr/local/cuda* and re-installed CUDA, the NVIDIA driver, and cudnn from NVIDIA’s .deb packages. Again, same result. Nothing changed.

On the same machine I am able to run pytorch models from Python on the GPU without issue. I do NOT have conda’s cudatoolkit installed, so it’s using the same cuda installation as libtorch. This is confirmed when I delete the /usr/ installation of cuda, the Python code ceases to be able to access the GPU.

My issue appears to be similar to this issue except that my CUDA samples run fine and I am having trouble only in libtorch, not in pytorch.

pytorch was installed with:

pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

libtorch was installed with:

wget https://download.pytorch.org/libtorch/cu111/libtorch-cxx11-abi-shared-with-deps-1.8.1%2Bcu111.zip
unzip libtorch-cxx11-abi-shared-with-deps-1.8.1+cu111.zip

I’ve tried building with GCC 10.3.0 as well as clang 12.0.0. Same result with both. cmake 3.20.1. In-between builds I am cleaning with rm -r CMakeFiles CMakeCache.txt.

Does anyone have any additional debugging steps or thoughts on why I am unable to access the GPU specifically in libtorch C++?

Lord have mercy. It was this in my CMakeLists.txt:

set(CMAKE_CXX_FLAGS "-fsanitize=undefined -fsanitize=address")

leftover from some debugging.

Sanitizers breaking CUDA isn’t too surprising. :man_facepalming:

libtorch runs fine on GPU in the absence of these compiler flags.