I’m a developer of a commercial application that requires initializing CUDA between dozens of independent processes. These processes are containerized for versioning - I can’t share much more, but using a single runtime to load several models isn’t possible.
However, initializing CUDA uses upwards of 2GB of RAM (not GPU memory). This is testable like so:
import torch a = torch.tensor(, device="cuda")
and monitoring RAM from any system tool. From my understanding, this is essentially all driver and library code, and thus the C++ runtime wouldn’t be significantly better.
I’ve tried ONNX (onnxruntime-gpu) and TensorRT in Python. They use about 1.5GB and 1.1GB of RAM respectively, which is still too much for my application. As people are deploying models on mobile devices I’m assuming there must be inference engines that are less memory intensive, but I haven’t found any in my searching that are compatible with NVIDIA GPUs.
Anyone run into similar problems or have any advice? How are people typically deploying models?