Run multiple independent models on single GPU

I want to train a bunch of small models on a single GPU in parallel. The models are small enough so that I can easily fit 20 or more on the GPU. Currently I can only run them sequentially leading to an underutilized GPU.

My code looks like this:

def main():
    num_models = 20

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    models = [Model().to(device) for _ in range(num_models)]

    for model in models:
        optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
        criterion = nn.CrossEntropyLoss()
        trainloader, testloader = dataloader()
        run_training(model, optimizer, criterion, trainloader, testloader, device)

The run_training() functions is implemented as follows:

def run_training(model, optimizer, criterion, trainloader, testloader, device):

    num_epochs = 2
    for epoch in range(num_epochs):  # loop over the dataset multiple times

        epoch_loss = 0.0
        epoch_counter = 0
        for i, data in enumerate(trainloader, 0):

            inputs, labels = data[0].to(device), data[1].to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
            epoch_counter += inputs.size(0)

        print('epoch %d loss %.3f' % (epoch + 1, epoch_loss / epoch_counter))

What can I do to parallelize the for-loop / to pass all models to the GPU at the same time?

Why don’t you just run independent python kernels?
CUDA calls are async but everything else runs on the main thread.
The only option you can do is to use threading with daemon calls.
I did so to run parallel instances of openpose although its backend is c++.
I do know that multiprocessing doesn’t work with cuda but not sure about threading.

Anyway the simplest option is using several kernels with a short bash script.

Hola! Thanks for your answer!

I can’t use independent Python kernen because I want the models to interact after training session. And there are a lot of training sessions involved. So I need to train them in a single script.

Why does multiprocessing not work with CUDA? What library can I use to try threading?

Well, it is supposed to work but the very few times I tried multiprocessing with cuda i had plenty of problems.
Here you can find the info you need:
https://pytorch.org/docs/stable/notes/multiprocessing.html

For threading, it’s a default python library.
Both share the same underlaying idea but they are different and threading is more difficult to use (explained here):

1 Like

Running independent processes is a bad idea (IMHO) if you have a large pre-processed dataset. You will quickly run out of memory. Alternatively, if you load samples from disk, you are slowed down by disk I/O (and having to unnecessarily redo preprocessing of the raw samples). Isn’t there a way to train multiple independent copies of the same (diffently initialized) model in parallel, but in such a way that each copy of the model would read from exactly the same location in GPU (or if that’s not possible, CPU) memory storing the dataset/dataloader object? I asked this question in a similar thread here