Performance issue in the first CUDA-driver API call when linking against libtorch

We are having the issue, that in a project where we link against pytorch as well as use our own (unrelated) cuda kernels and make CUDA driver API calls, the first call to the CUDA driver API is very slow (several seconds) for each device. This only happens when linking against libtorch, even if no calls to libtorch are made. If we don’t link against libtorch (or for each subsequent call) there is no such penalty and the first call is faster by x10 - x20.
(My general setup is cuda 11.3, pytorch 1.10.1, linux)

A fairly minimal example is

#include <chrono>
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>

int main() {
  auto t0 = std::chrono::steady_clock::now();
  int devCount;
  cudaGetDeviceCount(&devCount);
  if (devCount < 1) return 1;
  int deviceId = 0;
  cudaSetDevice(deviceId);
  size_t freeMem, totalMem;
  cudaError_t err = cudaMemGetInfo(&freeMem, &totalMem);
  auto t1 = std::chrono::steady_clock::now();
  std::chrono::duration<double, std::milli> ms0 = t1 - t0;
  std::cout << ms0.count() << " ms" << std::endl;
}

and

cmake_minimum_required(VERSION 3.18.0)
project(example CXX)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED true)

find_package(Torch PATHS $ENV{LIBTORCH_PATH} REQUIRED)
message(STATUS "Torch include directories: ${TORCH_INCLUDE_DIRS}")
message(STATUS "Torch libraries: ${TORCH_LIBRARIES}")

enable_language(CUDA)
set(CMAKE_CUDA_STANDARD 14)
set(CMAKE_CUDA_STANDARD_REQUIRED true)

find_library(CUDART_LIBRARY cudart ${CMAKE_CUDA_IMPLICIT_LINK_DIRECTORIES})

add_executable(minimal_torch_example main.cu)

target_link_libraries(minimal_torch_example PRIVATE ${TORCH_LIBRARIES})

This yields an executable that takes 2.2s to run on my computer; if I don’t link against libtorch (outcomment last line in cmakelists.txt) it takes <0.1s.

Running this very simple executable in nv-nsight-cu shows that ~60000 calls to cuModuleGetFunction are made, some of which (~30) fail with CUDA_ERROR_NOT_FOUND(500). I do not know any of this is unexpected or problematic, the program itself finishes without errors.

Do you have any insight into whether this initialisation cost is expected? Is there any way we can avoid it? Should we care about the failing calls to cuModuleGetFunction?

cheers!

1 Like

I think the “slowdown” is expected, as you are creating the CUDA context, which would load all kernels for the used compute capability. You could additionally check the memory usage via nvidia-smi (add a sleep into your code if needed) and in my setup I see ~255MB (without libtorch) vs. ~697MB (with libtorch).

Yes, for me the difference in memory usages is even larger, ~90MB vs ~1390MB.

I guess the best we can do then, is compile libtorch ourselves with as few features enabled as possible.

Either way, I’m still quite surprised by the cuModuleGetFunction:s that do not succeed.

cheers!