Training Multiple Models Simultaneously

Hello,

I am trying to train n-models. Each model has the same structure, same inputs, but is learning a different output. Since the models are distinct from one another, I would like to parallelize them. My attempt so far, is putting models on different GPUs. This is done through:

def try_gpu(i=0):  #@save
    """Return gpu(i) if exists, otherwise return cpu()."""
    if torch.cuda.device_count() >= i + 1:
        return torch.device(f'cuda:{i}')
    return torch.device('cpu')

model = [LinearNet(net_num).double().to(try_gpu(i%num_gpu) for i in range(net_num)]

So for example, if I have 6 models and num_gpu = 2, then model[0] is on cuda:0, model[1] is on cuda:1, etc.

I have also made sure to put the inputs onto the different GPUs as well. However, I’m stuck at actually making this parallel - I’m still just looping over models and running them sequentially. How would I actually run model[0] and model[1] in parallel?

Thanks for the help!

1 Like

Hi @semperDM

In your case I’d propose to give a try to Hydra parallel multirun (Joblib Launcher plugin | Hydra) or simply to parameterize your script with command line arguments and launch multiple Python script instances using Bash scripting

Hello,

I’m trying to avoid doing multiple script launches, because I’m trying to avoid writing data and then reading it back in. Currently, I have another script that creates data and then that data is immediately fed into the neural network script. If I do multiple launches with Bash, then I either have to keep rerunning the data making script, or keep reading in the data, neither of which seems optimal. As things scale up in terms of data and number of networks being trained, it would become slower and slower to read it in and to recalculate it, hence why I’m trying to avoid that.

I don’t understand why it would write and read the data back in. If you say that the experiments have the same inputs.

Decouple data creation script from the model, and it won’t affect the training

Hello,

If I decouple the data creation script from the model, then I must write all of that data somewhere. Then, every time I launch a new model, I will have to read in all of this data. Once our data reaches the millions, this will become rather inefficient, so I’d rather keep them coupled so that the data is just passed to the model script, and then the model script itself is parallelized to run the n models on different GPUs.

In that case, you probably should look at in memory datasets and Joblib launcher of model training in parallel. Once you created in memory dataset you can instantiate N dataloaders that is pased to a N different models on start of the job

Your workflow can launch the model execution on different devices and they will be executed asynchronously unless you are synchronizing explicitly (via torch.cuda.synchronize()) ir implicitly e.g. via tensor.nonzero(), tensor.item() etc.
If you are not seeing overlapping execution in e.g. Nsight Systems, your CPU (and the overall dispatching) might be too slow compared to the GPU execution time.

1 Like

Hello @ptrblck ,

So what I’ve done so far, is that I send model1 to gpu 1 and model2 to gpu 2, and I define

models = [model1, model2]

In my training model then, I have

loss_fn = nn.MSELoss()
optimizer = [optim.Adam(_.parameters(), lr=lr for _ in models]

After sending the input tensors (X,y) to the appropriate GPUs as well, I then have

for _ in models:
    _.train()

pred = []
for i in range(len(models)):
    pred.append(model[i](X[i])

And it is at this point that despite the models being on different GPUs, that I’m worried I’m still doing things in a serial fashion.

Thanks for the help!

1 Like

The CPU will sequentially launch the model execution, but as long as no model synchronizes internally, both would be executed in parallel on both GPUs.
You could verify it with a profiler as mentioned before. If you do this and don’t see an overlap, your training might be CPU bound and the CPU might not be fast enough in the work scheduling compared to the runtime of the models on the GPU.

1 Like

Hello everyone, I’m trying to do something similar but on the same GPU. In my case, I believe I need to use the Process from torch.multiprocessing. However, I see that there is not a real boost in terms of performance with respect to the sequential case. I’m doing something like:

import torch.multiprocessing as mp
from torch.multiprocessing import Process

if __name__ == '__main__':
    mp.set_start_method('spawn')
    ...
    procs = []
    for i in range(N):
        procs.append(Process(target=train, args=(models[i],data[i],)))

    for p in procs:
        p.start()
    
    for p in procs:
        p.join()

Is there something wrong I’m doing? I’m not sure about how the GPU memory is then shared/used by the processes. I’d like the models to be trained as completely separate entities.
Thanks for your answers!