Parallelize loading model to GPU

ap579 · October 12, 2020, 5:11pm

Hi!

I’m trying to run many models on a single GPU by switching them in and out as needed (they don’t all fit in GPU memory together), but I’m finding that loading each model:

model = model.to('cuda')

takes 20-80ms (ex/ VGG16: ~80ms). If I want to load two different VGG16 models at a time, is there a way to parallelize loading such that the total load time is < 160ms?

Thanks!

ptrblck · October 13, 2020, 7:45am

You could call the to() methods multiple times on different models, but note that the memory bandwidth is limited so you cannot push the parameters faster to the device than your system allows.