Parallelize loading model to GPU


I’m trying to run many models on a single GPU by switching them in and out as needed (they don’t all fit in GPU memory together), but I’m finding that loading each model:

model ='cuda')

takes 20-80ms (ex/ VGG16: ~80ms). If I want to load two different VGG16 models at a time, is there a way to parallelize loading such that the total load time is < 160ms?


You could call the to() methods multiple times on different models, but note that the memory bandwidth is limited so you cannot push the parameters faster to the device than your system allows.