I want to train a bunch of small models on a single GPU in parallel. The models are small enough so that I can easily fit 20 or more on the GPU. Currently I can only run them sequentially leading to an underutilized GPU.
My code looks like this:
def main():
num_models = 20
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
models = [Model().to(device) for _ in range(num_models)]
for model in models:
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
criterion = nn.CrossEntropyLoss()
trainloader, testloader = dataloader()
run_training(model, optimizer, criterion, trainloader, testloader, device)
The run_training()
functions is implemented as follows:
def run_training(model, optimizer, criterion, trainloader, testloader, device):
num_epochs = 2
for epoch in range(num_epochs): # loop over the dataset multiple times
epoch_loss = 0.0
epoch_counter = 0
for i, data in enumerate(trainloader, 0):
inputs, labels = data[0].to(device), data[1].to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
epoch_counter += inputs.size(0)
print('epoch %d loss %.3f' % (epoch + 1, epoch_loss / epoch_counter))
What can I do to parallelize the for-loop / to pass all models to the GPU at the same time?