I have a model which successfully inferenced by the PyTorch C++ interface using the torch::jit::script::Module.
When the application runs as a stand alone process in the system everything is working fine but when I add an additional CUDA based applications which consume also parts of the GPU memory, the following operation fails:
I assume that the failure root cause is unavailable GPU memory.
Questions:
Is there any way to control this dynamic failure?
Is there a way to statically know how much GPU memory will be required for inference operation based on the model?
Is there a way to tell the PyTorch to use the CPU memory in case no more available GPU memory exist?
As the Tensorflow C++ interface provide:
For example:
You could wrap your code in a try..except block and push all operations to the CPU in case an OOM error is raised.
Besides that I’m not aware of any API calls to limit the GPU memory.
Is there any query provided by the PyTorch C++ APIs interface to check ahead how much device memory will be consumed by the model nodes and their weights?
By this way I will be able to decide ahead on which device I will inference the model.
Is there a PyTorch C++ APIs which enable to split the model nodes between multiple devices? for example split the model between the CPU & GPU…
I readed here: PyTorch C++ Opcodes
that this interface provide the ability to use multiple GPUs.
So I thought that maybe it can be processed also by the CPU while part of it processed by the GPU.
No, since memory fragmentation and in particular different backends might not yield deterministic results and might also differ based on the device, versions, as well as available memory.
E.g. if you are using torch.backends.cudnn.benchmark = True, cudnn will profile different kernels for your current workload and will select the fastest one, which would fit in the available workspace. This would also mean that different algorithms (and thus memory footprints) might be used, if your device is almost full. The best way would be to use the deterministic mode and run a single training iteration in order to estimate the memory usage.
Since the to() operation is differentiable you can easily apply model sharding via:
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.module0 = nn.Linear(1, 1).to('cuda:0')
self.module1 = nn.Linear(1, 1).to('cuda:1')
self.module2 = nn.Linear(1, 1).cu('cpu') # to('cpu') not necessary here
def forward(self, x):
x = x.to('cuda:0')
x = self.module0(x)
x = x.to('cuda:1')
x = self.module1(x)
x = x.to('cpu')
x = self.module2(x)
return x