GPU memory consumption control during inference process

Hello,

I have a laptop with:
OS Name - Microsoft Windows 10 Enterprise 2016 LTSB
Processor - Intel® Core™ i7-6700HQ CPU @ 2.60GHz, 2592 Mhz, 4 Core(s), 8 Logical Processor(s)
|Installed Physical Memory (RAM)|16.0 GB|
|Total Physical Memory|15.8 GB|
|Available Physical Memory|8.92 GB|
|Total Virtual Memory|18.3 GB|
|Available Virtual Memory|8.75 GB|
|Page File Space|2.53 GB|

Adapter Description Intel® HD Graphics 530
Adapter Description NVIDIA Quadro M2000M

The NVIDIA GPU has only 4GB RAM

I have a model which successfully inferenced by the PyTorch C++ interface using the torch::jit::script::Module.

When the application runs as a stand alone process in the system everything is working fine but when I add an additional CUDA based applications which consume also parts of the GPU memory, the following operation fails:

std::shared_ptr<torch::jit::script::Module> module;
output = module->forward(inputs).toTensor();
CUDA(cudaDeviceSynchronize());

I assume that the failure root cause is unavailable GPU memory.

Questions:

  • Is there any way to control this dynamic failure?

  • Is there a way to statically know how much GPU memory will be required for inference operation based on the model?

  • Is there a way to tell the PyTorch to use the CPU memory in case no more available GPU memory exist?
    As the Tensorflow C++ interface provide:
    For example:

tensorflow::SessionOptions sessionOptions;
m_sessionOptions.config.mutable_gpu_options()->set_allow_growth(true);

Or:

sessionOptions.config.mutable_gpu_options()->set_per_process_gpu_memory_fraction(0.8);

Thanks,

You could wrap your code in a try..except block and push all operations to the CPU in case an OOM error is raised.
Besides that I’m not aware of any API calls to limit the GPU memory.

Thank you.
I have two questions:

  1. Is there any query provided by the PyTorch C++ APIs interface to check ahead how much device memory will be consumed by the model nodes and their weights?
    By this way I will be able to decide ahead on which device I will inference the model.

  2. Is there a PyTorch C++ APIs which enable to split the model nodes between multiple devices? for example split the model between the CPU & GPU…
    I readed here:
    PyTorch C++ Opcodes
    that this interface provide the ability to use multiple GPUs.
    So I thought that maybe it can be processed also by the CPU while part of it processed by the GPU.

  1. No, since memory fragmentation and in particular different backends might not yield deterministic results and might also differ based on the device, versions, as well as available memory.
    E.g. if you are using torch.backends.cudnn.benchmark = True, cudnn will profile different kernels for your current workload and will select the fastest one, which would fit in the available workspace. This would also mean that different algorithms (and thus memory footprints) might be used, if your device is almost full. The best way would be to use the deterministic mode and run a single training iteration in order to estimate the memory usage.

  2. Since the to() operation is differentiable you can easily apply model sharding via:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.module0 = nn.Linear(1, 1).to('cuda:0')
        self.module1 = nn.Linear(1, 1).to('cuda:1')
        self.module2 = nn.Linear(1, 1).cu('cpu') # to('cpu') not necessary here
        
    def forward(self, x):
        x = x.to('cuda:0')
        x = self.module0(x)
        x = x.to('cuda:1')
        x = self.module1(x)
        x = x.to('cpu')
        x = self.module2(x)
        return x