I am creating an API Rest that does inference with several models given a list.
The problem is that all the models don’t fit at GPU at the same time.
Is there a way of loading a model into GPU make inference with that model and move it to CPU and load next model to GPU for inference then to CPU…
You could push a single model to the device, perform the forward pass to get the prediction, and push it back to the host using a loop like:
models = [...] # define a list of all models on the CPU
input = ... # get your input
for model in models:
model.to('cuda')
pred = model(input)
model.to('cpu')
1 Like
Thank you for comming to this issue
I have tried that approach, the problem that I have seen is that if I don’t delete pred
variable after using, it keeps memory in use and when I load other model cuda memory error is thrown!
For inference only, you should wrap the forward pass in a with torch.no_grad()
block, which will not store the intermediate tensors, which would be needed for for backward
call.
Good to know. Do I need to delete the predictions too?
If you’ve wrapped them in the no_grad
block, then it’ll be just a tensor on the current device (and would probably not take much memory).
You could delete it anyway, if you want to save this memory, but I don’t think it’s necessary.
Ok, thank you for the infor!