Training in parallel

blade · May 2, 2020, 7:36pm

I’m training a VAE similar to the implementation in PyTorch’s Github. The main function looks like:

if __name__ == "__main__":
    for epoch in range(1, args.epochs + 1):
        train(epoch)

Assuming that I have another input parameter for training function, pi, I would like to write a code that trains multiple models with different parameter pi.

if __name__ == "__main__":
    for i in range(10):
        pi = get_param(seed=i)
        for epoch in range(1, args.epochs + 1):
            train(epoch, pi)

My question is how can I run this in parallel on GPU, such that each core trains a single model.

ptrblck · May 3, 2020, 3:55am

If your GPU is already fully utilized, you won’t be able to train models in parallel and the calls will be added to the queue.
On the other hand, if you have multiple devices, you could run the training routines on each device and they will be run in parallel.