Reduce GPU Memory blocked by context


I’m currently working on a single-GPU system with limited GPU memory where multiple torch models are offered as “services” that run in separate python processes.
Ideally, I would like to be able to free the GPU memory for each model on demand without killing their respective python process. However, if I only delete the models (and empty the cache) without killing the process, I’m still stuck with the memory usage of the cuda context, which is about 1.2 GB per process in my case.

In general, the memory usage of the context seems extremely high to me, especially when considering that the model itself only adds about 200MB to it while it’s loaded.
So my question is: Are there any options to decrease the memory footprint of the cuda context or to get rid of it altogether when running inference on the models?

I’d be grateful for advice!

You could reduce the CUDA context size by removing kernels e.g. via dropping libraries such as MAGMA or cuDNN, if these are not used.
Besides that I’m not aware of reducing the operator set in the framework itself.

Thanks for the reply! So do I understand correctly that this memory will always be occupied one way or another if I want to use cuda functionality, no matter if it’s via pytorch or if I’d e.g. export the models to onnx and use them from c++?

The memory will be occupied after loading the CUDA kernels.
Depending on the build you are using, all libraries might be loaded directly when the context is created (e.g. in the pip wheels as libs are statically linked) or when they are used the first time (e.g. cuDNN via dynamic linking should behave like this).
Regardless of this behavior, if you are using “all libraries” the context size will be the same.