How to use multiple models to perform inference on same data in parallel?

I hope to perform the ensemble inference on a same validation data on multiple GPUs (i.e. 4 GPUS).
Originally, there was some data parallellism in this framework, and if I just used one single model for inference, it worked well with utility for all 4 GPUs above 85%.

But if I tried to use 2 models to do the inference, it got much slower and both GPU and CPU utility dropped to 25%. I think it must be caused by my not using the correct method to parallel it (I am using a for loop here):

##### The is for the evaluation ######
pretrained_models = ['model1', 'model2'] 
pool = []
for i, cur_model in enumerate(pretrained_models):
    prediction = prediction_dict[cur_model]
    pool.append(prediction.unsqueeze(0))
    if(i==len(pretrained_models)-1):
         tmp = torch.cat(pool)
         ensemble_pred = tmp.mode(dim=0).value
         my_metric_save(ensemble_pred);

The basic idea is, assuming we already have the prediction vectors obtained from both pretrained models, I am using a for-loop to extract them one after another, and finally combined them together as a new prediction vector “ensemble_pred”. I don’t know how to profile the runtime, but this probably destroyed the original parallel flow, so it slowed down the validation dramatically.

Could someone provide some guidance on what is the efficient way to do ensemble inference (mulitple pretrained models to evaluate one same data)?

Depending on the relative computational cost of the models, it maybe difficult to parallelize them across multiple GPUs “simultaneously” and synchronize the models across each batch. Since this is for pretrained models, can you do the predictions in a more “offline” way, where the first model processes all the data followed by the second model with the predictions being aggregated after both models are done?

1 Like

Thanks for the reponse. I will try to do that in an “offline” manner.

Yeah the original model might occupies multiple GPUs, in this case if you try to parallel to models to do the inference on the set of GPUs, there might be more synchronizations happens to ensure two model computation not interfering each other, which might reduce the GPU utility in general. You can try what @eqy suggested. Also, if you like to know how your two models use GPU separately, you can do a profling for the two models separately with pytorch profiler, and see if there’re already operations occuping multiple GPUs.

1 Like