Sequential inference of multiple models in a memory limited environment

Dear fellow PyTorch users, I have a question concerning deployment in memory-constrained environments.

I wish to deploy a large number (more than 20) of models on the same data. For example, one model would detect the cars in the image and a separate network would segment the streets, another would apply domain transfer to improve the model performance of the other networks if the given scene has unusual characteristics (rain, low lighting, etc.).

Though it may be better to have a single model perform these tasks, I have found that using separate models gives much better performance. Also, our clients (and I myself) value the modularity that using separate modules provides.

Currently, using Tensorflow in graph mode provides automatic memory optimizations in GPU and CPU RAM memory. Even if there are 20 models, with 1GB of parameters in each model, the GPU memory allocated is not 20GB because Tensorflow automatically loads only the necessary models from disk to RAM and GPU DRAM. This saves on both GPU memory and RAM, both of which are valuable resources in deployment.

Is there any way to do something similar in PyTorch? I would like to run multiple, potentially very heavy, models in sequence on the same data without exhausting my GPU and RAM memory limitations. I have considered manually loading and sending each model in sequence. However, this would lead to inefficiencies if disk IO or CPU/GPU transfer were a bottleneck. Also, TorchScript models seem to discourage transferring the model after scripting is complete.

I have considered converting to ONNX. However, ONNX supports a relatively limited set of PyTorch operators and has numerous implementation differences that may lead to different outputs. I would like to ask if there is a PyTorch native solution to this problem.

Many thanks in advance for any replies.

1 Like

Good day vertias,

without knowing details about the exact memory footprint, architecture and payload of inferences, it is not easy to give detailed help, but:

  • Have you taken a look at the mobile Interpreter? This may reduce the memory footprint by up to 75%.
  • Consider spending more time on producing an end-to-end architecture, since this will save all your problems in one go. A good segmentation architecture should in general be capable of segmenting streets AND cars.
  • Reduce the architecture of your model by applying Knowledge Distillation.

Cheers.

@Chris_77 Thank you for your reply.

As you have mentioned, using smaller models would be one possible solution.

However, the problem that I face is applying multiple models on the same data in sequence. This setting is not as uncommon as one may imagine.

In this circumstance, I think that it is inefficient to use scarce memory to hold models that do not have to be used immediately.