Train multiple models on multiple GPUs

Boyu_Zhang · April 24, 2018, 5:03am

Is it possible to train multiple models on multiple GPUs where each model is trained on a distinct GPU simultaneously?

for example, suppose there are 2 gpus,

model1 = model1.cuda(0)
model2 = model2.cuda(1)

then train these two models simultaneously by the same dataloader.

ptrblck · April 24, 2018, 9:46am

It should work! You have to make sure the Variables/Tensors are located on the right GPU.
Could you explain a bit more about your use case?
Are you merging the outputs somehow or are the models completely independent from each other?

Boyu_Zhang · May 2, 2018, 2:02am

Hi ptrblck, thanks for your reply. The models are completely independent from each other but in some training steps, the models would transfer information between each other. So I need to train these models simultaneously. BTW, if I want to train all the models simultaneously, how do I write the code? Currently, my code is like the following, but I guess the models are trained in a sequential manner,

model1 = model1.cuda(0)
model2 = model2.cuda(1)
models = [model1, model2]

for (input, label) in data_loader:
      for m in models:
           m.train()
           optimizer.zero_grad()
           output = m(input)
           loss = criterion(output, label)
           loss.backward()
           optimizer.step()

ptrblck · May 2, 2018, 1:53pm

I think in your current implementation you would indeed have to wait until the optimization was done on each GPU.
If you just have two models, you could push each input and target tensor to the appropriate GPU and call the forward passes after each other.
Since these calls are performed asynchronously, you could achieve a speedup in this way.
The code should look like this:

input1 = input.to('cuda:0')
intput2 = input.to('cuda:1')
# same for label
optimizer1.zero_grad()
optimizer2.zero_grad()

outpu1 = model1(intput1) # should be an asynch call
outpu2 = model2(intput2)
...

Unfortunately I cannot test it at the moment. Would you run it and check if it’s suitable for your use case?

Maxence_Ernoult · March 14, 2019, 4:24pm

Hi !
I am still interested in the topic. I am very new to Pytorch and currently would like to perform parallel training of different models on different GPUs (i.e. one model/GPU) for hyperparameter search or simply to get results for different weight initializations. I know there is a lot of documentation pertaining to multiprocessing and existing frameworks for hyperparameter tuning which I already checked, however I only have a limited amount of time and thus on the look out for the very simplest way to achieve this. It would be extremely helpful, thank you for your attention.

Boyu_Zhang · March 17, 2019, 9:46am

You can look at Horovod which is developed by UBer. It makes parallel training extremely easy.

Maxence_Ernoult via PyTorch Forums noreply@discuss.pytorch.org于2019年3月15日周五上午3:34写道：

SylarCheng · May 27, 2020, 10:30am

Hi Boyu, how did you implement this finally? I am confuesed with this problem, i am grateful with your apply.

Boyu_Zhang · June 1, 2020, 2:28am

Hi finally I used the lib “mpi4py” to implement this. With MPI, you can assign each rank to train one model on one GPU. Also, MPI supports communication across ranks with which you can implement some special operations.

iffiX · June 1, 2020, 3:50am

MPI is not necessary here, torch.distributed package now provides MPI style and rpc style distributed apis. Moreover it also supports gloo mpi and nccl backends (MPI style only), so if you don’t want more hassles, they should be sufficient.

SylarCheng · June 4, 2020, 3:20am

Thanks BoYu and iffi, WoW two path to get result, I like these!

Michelle_Owen · August 27, 2020, 2:21pm

@Boyu_Zhang can you provide the code example how you achieved this?

heroadz · October 30, 2020, 5:29am

Still confused with this problem. Which method will be ok for it? asyncio? Could you provide some code or useful post?

Aray · November 18, 2020, 6:28pm

Here is my implementation of CycleGAN, where I parallelize training by making use of 4 GPUs.

CycleGAN consists of 4 models: Generator+Discriminator for type A images, and Generator+Discriminator for type B images.

I train the first pair of G and D on one device and the second pair on the other.

netG_B2A = Generator().to(device1)
netD_A = Discriminator().to(device1)
netG_A2B = Generator().to(device2)
netD_B = Discriminator().to(device2)

cycle_loss1 = torch.nn.L1Loss().to(device1)
cycle_loss2 = torch.nn.L1Loss().to(device2)
identity_loss1 = torch.nn.L1Loss().to(device1)
identity_loss2 = torch.nn.L1Loss().to(device2)
adversarial_loss1 = torch.nn.MSELoss().to(device1)
adversarial_loss2 = torch.nn.MSELoss().to(device2)

I also have 2 copies of input data and 2 copies of each loss functions that I calculate on different devise.

real_image_A1 = data[0].to(device1)
real_image_B1 = data[1].to(device1)
real_image_A2 = data[0].to(device2)
real_image_B2 = data[1].to(device2)

Moreover, I have to copy the resulting images (which will be feed into the second pair of G and D) of one G to another device.

# ...
fake_image_B = netG_A2B(real_image_A2)
# ...
recovered_image_A = netG_B2A(fake_image_B.to(device1))

The final loss is calculated on the CPU.

errG = loss_identity_A.cpu() + loss_identity_B.cpu() + loss_GAN_A2B.cpu() + loss_GAN_B2A.cpu() + loss_cycle_ABA.cpu() + loss_cycle_BAB.cpu()

Besides, I have 2 classification model which are allocated on their own devices.