How to run inference in parallel on a single GPU with a single copy of model?

I have a relatively simple model. it is a classifier finetuned with a pretrained encoder from huggingface (transformers). It takes a text as input and produces a number between 0 to 1. We classify based on a threshold.

I trained it on multiple GPUs using DDP. But now I have a long list of examples (test_list) on which I need to run inference. I am aware of the method where I can use DDP again and divide the test_list onto multiple GPUs (like this). But downside of this method is that if I have n GPUs then I can partition the test_list into at max n partitions. Moreover, each GPU will have a copy of model running on them.

Another thing I can do is to divide the test_list into N (>>n) smaller lists and for each smaller list I can run inference using N bash shells. Like this (one command on each bash shell)

CUDA_V_D=0 python --partition 1
CUDA_V_D=0 python --partition 2
CUDA_V_D=0 python --partition N

But this method will also create N copies of model on the single GPU.

Is there a way by which I can create a single copy of model on a single GPU but run inference in parallel?

You don’t really want to this if you don’t have to. It is better to do async processing + request batching to run the model. This is because sharing the GPU can be rather costly. If you use the torch jit (as long as it works, alas), you also escape the GIL that way. We do have a demo of that in Part III of our book.

Best regards


1 Like