Assume I have a list of models models = [model1, model2, model3, ....], I want to get the prediction of these models on the same data tensor in parallel. What would be the best way to do it? I tried to use multiprocessing.Pool, but it seems the process starts very slow.
(I’m assuming that you either have just a single hardware device)
Simply leveraging CUDA Graphs and running the models in sequence might be tough to beat if your workload has static shapes:
It would be unlikely, but if you could express your model(s) as subnetworks in a single large model (e.g., by combining that could also be a way to accelerate your workload. A trivial example of this would be if your models only had linear layers (where each corresponding layer had the same shape), you could combine the smaller models into a single model with batch matrix multiplication: torch.bmm — PyTorch 2.0 documentation (b would be the model dimension in this case).