Slower using 2 GPUS than just one

Hi guys, I’ve just tried to run my network with two gpus, and here is my code:

if use_gpu:
    model = nn.DataParallel(model)
    model = model.cuda()

criterion = nn.CrossEntropyLoss()
optimizer_ft = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)

def Tensor2Variable_CEL(input, label):
    input = Variable(input).cuda().float()
    label = Variable(label).cuda().long()
    return input, label

def train_model(model, criterion, optimizer, scheduler, num_epochs):
    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch+1, num_epochs))
        print('-' * 10)
        since = time.time()

        for (input, label) in train_loader:
            # prepare datas
            input, label = Tensor2Variable_CEL(input, label)
            # run the model
            output = model(input)
            loss = criterion(output, label)

        test_model(model, criterion)

        time_diff = time.time() - since
        print('epoch complete in {:0.6f}'.format(time_diff))

But what shocked me was that it’s much slower than before I didn’t use nn.DataParallel(model), i.e. just use one gpu in one epoch. Without model = nn.DataParallel(model), every epoch takes about 15 seconds while with it takes about 30 seconds. Except the model = nn.DataParallel(model), I didn’t change anything on my network or training process. Is there anything I did wrong? Thanks in advance.

1 Like


The multi-gpu performance will depend a lot on how much work the gpu has to do. were you fully using one gpu already? How much data do you transfer to the gpus when forwarding?
Multi-gpu comes at the cost of model synchronization, if your model is small, this cost will be larger than the time to perform the forward pass itself.


I had a similar problem while trying to train my network on two GPUs, but I ended up using only one GPU per model and using the other GPU to test out changes in my model and/or parameters.
Since my model was relatively small, I guess i was suffering from the cost of model synchronization mentioned by @albanD.