API Rest with several models loaded using GPU but not at same time

WaterKnight · June 2, 2020, 9:23pm

I am creating an API Rest that does inference with several models given a list.

The problem is that all the models don’t fit at GPU at the same time.

Is there a way of loading a model into GPU make inference with that model and move it to CPU and load next model to GPU for inference then to CPU…

ptrblck · June 6, 2020, 11:13am

You could push a single model to the device, perform the forward pass to get the prediction, and push it back to the host using a loop like:

models = [...] # define a list of all models on the CPU

input = ... # get your input
for model in models:
    model.to('cuda')
    pred = model(input)
    model.to('cpu')

WaterKnight · June 6, 2020, 11:16am

Thank you for comming to this issue

I have tried that approach, the problem that I have seen is that if I don’t delete pred variable after using, it keeps memory in use and when I load other model cuda memory error is thrown!

ptrblck · June 6, 2020, 9:45pm

For inference only, you should wrap the forward pass in a with torch.no_grad() block, which will not store the intermediate tensors, which would be needed for for backward call.

WaterKnight · June 6, 2020, 9:51pm

Good to know. Do I need to delete the predictions too?

ptrblck · June 6, 2020, 10:08pm

If you’ve wrapped them in the no_grad block, then it’ll be just a tensor on the current device (and would probably not take much memory).
You could delete it anyway, if you want to save this memory, but I don’t think it’s necessary.

WaterKnight · June 6, 2020, 10:39pm

Ok, thank you for the infor!