How to simultaneously train two different models on two different GPUs, using the same data?

GustavoVargasHakim · November 7, 2020, 11:28pm

Hi there!

This is a question that might sound simple, but I can’t find any information that actually works when I implement it.

So, I have to models (net1 and net2), and I have two GPUs (“cuda:0” and “cuda:1”). I have the training and test dataloaders (train_dl, test_dl). I just want to train both models simultaneously, like:

net1.to(decive1)
net2.to(device2)

accuracy1 = training(net1)
accuracy2 = training(net2)

How do I do this? How do I make sure these last two functions don’t execute sequentially? I want them to execute at the same time. I have read about something related to “spawn” but I have really no idea how to implement that.

Thank you!

agolynski · November 9, 2020, 9:16pm

Hi,

If net1 and net2 do not depend on each other you can just spawn 2 processes and train each model in its own process. You can do

import torch.multiprocessing as mp

def your_func(int rank):
   if rank == 0: 
      train net1
   else:
      train net2
...
    mp.spawn(
        your_func,
        args=(),
        nprocs=2,
        join=True)

Would that work for you?

GustavoVargasHakim · November 9, 2020, 11:01pm

Thank you very much! This is exactly what I needed. I have only one more question:

If I wanted to return something from my function (such as the accuracy value), what would be the best way to do so? I was thinking about having a global list a updating that list inside the function, but this might be a little bit odd probably.

Thanks again!!

rvarm1 · November 10, 2020, 8:41pm

A global list may be tricky since each process would have its own copy of the global list. Overall you would need to communicate the accuracy across processes to aggregate them, you can look into python multiprocessing such as mp.Manager(): https://docs.python.org/3/library/multiprocessing.html.