I have a relatively simple model. it is a classifier finetuned with a pretrained encoder from huggingface (transformers). It takes a text as input and produces a number between 0 to 1. We classify based on a threshold.
I trained it on multiple GPUs using DDP. But now I have a long list of examples (test_list) on which I need to run inference. I am aware of the method where I can use DDP again and divide the test_list onto multiple GPUs (like this). But downside of this method is that if I have n GPUs then I can partition the test_list into at max n partitions. Moreover, each GPU will have a copy of model running on them.
Another thing I can do is to divide the test_list into N (>>n) smaller lists and for each smaller list I can run inference using N bash shells. Like this (one command on each bash shell)
CUDA_V_D=0 python file.py --partition 1
CUDA_V_D=0 python file.py --partition 2
...
CUDA_V_D=0 python file.py --partition N
But this method will also create N copies of model on the single GPU.
Is there a way by which I can create a single copy of model on a single GPU but run inference in parallel?