Dear fellow PyTorch users, I have a question concerning deployment in memory-constrained environments.
I wish to deploy a large number (more than 20) of models on the same data. For example, one model would detect the cars in the image and a separate network would segment the streets, another would apply domain transfer to improve the model performance of the other networks if the given scene has unusual characteristics (rain, low lighting, etc.).
Though it may be better to have a single model perform these tasks, I have found that using separate models gives much better performance. Also, our clients (and I myself) value the modularity that using separate modules provides.
Currently, using Tensorflow in graph mode provides automatic memory optimizations in GPU and CPU RAM memory. Even if there are 20 models, with 1GB of parameters in each model, the GPU memory allocated is not 20GB because Tensorflow automatically loads only the necessary models from disk to RAM and GPU DRAM. This saves on both GPU memory and RAM, both of which are valuable resources in deployment.
Is there any way to do something similar in PyTorch? I would like to run multiple, potentially very heavy, models in sequence on the same data without exhausting my GPU and RAM memory limitations. I have considered manually loading and sending each model in sequence. However, this would lead to inefficiencies if disk IO or CPU/GPU transfer were a bottleneck. Also, TorchScript models seem to discourage transferring the model after scripting is complete.
I have considered converting to ONNX. However, ONNX supports a relatively limited set of PyTorch operators and has numerous implementation differences that may lead to different outputs. I would like to ask if there is a PyTorch native solution to this problem.
Many thanks in advance for any replies.