What comes to me mind (and I have done sth similar) is to prepare a dict that would look like:
dict = {‘model1’: Model_1, ‘model2’: Model_2, … }
where Model_X is a PyTorch model with loaded weights and sitting on CPU.
During the run I would retrieve models sequentially by just getting one at a time from the dict. Move it to GPU, run inference, move back to CPU, clear GPU cache, load model2, …, etc.
The problem is definitely the time of loading/unloading models to/from GPUs. That would take a while definitely.
models = [...] # define a list of all models on the CPU
input = ... # get your input
for model in models:
model.to('cuda')
pred = model(input)
#make something with pred
del pred
model.to('cpu')
They key fact is that you need to delete predictions variable if you don’t wank cuda memory issues