Training multiple models on one GPU simultaneously

limo94 · October 17, 2019, 6:17pm

So I’m trying to train multiple (let’s say 10) models simultaneously on a single GPU. The way I’m trying to go about this is have a list of models, and optimizers. The models are stored on CPU, and I iterate through the model list, putting one model on the GPU, training for an epoch, and then putting the model back on CPU and moving to the next model on the list. The problem is, usually one of the models in the list has a very high loss value (nan sometimes). I’m wondering if there is some sort of leak happening?

albanD · October 17, 2019, 6:21pm

Hi,

This should work properly if you move models with .cuda() and .cpu().
Can you give more details about what you’ve tried? Does running a single model works?

limo94 · October 17, 2019, 6:29pm

Hi,

Thanks for the quick response. I start with the list of 10 networks, and move them to CPU with

model[i].to(‘cpu’)

and I put them in eval mode. Then, when I want to train one of them for an epoch, I say

model[i].to(‘cuda’)
model[i].train()
optimizer[i].zero_grad()
model[i].zero_grad() – I know this is redundant, just thought I’d try it
…
loss.backward()
optimizer[i].step()
…
models[i].eval()
models[i].to(‘cpu’)

Running a small number of models (2) seems to work fine.

albanD · October 17, 2019, 6:38pm

You do call optimizer[i].zero_grad() (or the model version) at every iteration during your epoch right? You can check this discussion if you do not: Why do we need to set the gradients manually to zero in pytorch?

Otherwise, it looks good !

limo94 · October 17, 2019, 6:45pm

Yes. I call it after every batch is fed through the network and the loss is calculated/backward is called.

albanD · October 17, 2019, 7:07pm

Then given what you shared, I don’t see any reason for a model to behave differently than the others.
Is it always the same model in the list that lags behind? Do you see any pattern?