Training multiple models on one GPU simultaneously

So I’m trying to train multiple (let’s say 10) models simultaneously on a single GPU. The way I’m trying to go about this is have a list of models, and optimizers. The models are stored on CPU, and I iterate through the model list, putting one model on the GPU, training for an epoch, and then putting the model back on CPU and moving to the next model on the list. The problem is, usually one of the models in the list has a very high loss value (nan sometimes). I’m wondering if there is some sort of leak happening?


This should work properly if you move models with .cuda() and .cpu().
Can you give more details about what you’ve tried? Does running a single model works?


Thanks for the quick response. I start with the list of 10 networks, and move them to CPU with


and I put them in eval mode. Then, when I want to train one of them for an epoch, I say

model[i].zero_grad() – I know this is redundant, just thought I’d try it



Running a small number of models (2) seems to work fine.

You do call optimizer[i].zero_grad() (or the model version) at every iteration during your epoch right? You can check this discussion if you do not: Why do we need to set the gradients manually to zero in pytorch?

Otherwise, it looks good !

Yes. I call it after every batch is fed through the network and the loss is calculated/backward is called.

Then given what you shared, I don’t see any reason for a model to behave differently than the others.
Is it always the same model in the list that lags behind? Do you see any pattern?