API Rest with several models loaded using GPU but not at same time

I am creating an API Rest that does inference with several models given a list.

The problem is that all the models don’t fit at GPU at the same time.

Is there a way of loading a model into GPU make inference with that model and move it to CPU and load next model to GPU for inference then to CPU…

You could push a single model to the device, perform the forward pass to get the prediction, and push it back to the host using a loop like:

models = [...] # define a list of all models on the CPU

input = ... # get your input
for model in models:
    pred = model(input)
1 Like

Thank you for comming to this issue :sweat_smile:

I have tried that approach, the problem that I have seen is that if I don’t delete pred variable after using, it keeps memory in use and when I load other model cuda memory error is thrown!

For inference only, you should wrap the forward pass in a with torch.no_grad() block, which will not store the intermediate tensors, which would be needed for for backward call.

Good to know. Do I need to delete the predictions too?

If you’ve wrapped them in the no_grad block, then it’ll be just a tensor on the current device (and would probably not take much memory).
You could delete it anyway, if you want to save this memory, but I don’t think it’s necessary.

Ok, thank you for the infor!