Multi-GPUs training, inbalance gpu memory, what to do with loss function

I have been using pytorch for a long time, but I still could not find a clear solusion for the problem of multigpu training.
Can someone please help me out.
When we have multiple gpu and large batch size I do the following
net = nn.DataParallel(net)
and it simply transfer my model to parallel. Nice!
But what should I do for optimization part? I notice something while using multi gpu that it was not there for when I had just one gpu…

  1. the memory for gpus is not balance, usually one of the gpus are having more memory
  2. the performance drops! why? I dont know

I think I am still not clear what should I do for criterion and loss function when I am using multiple gpu.
I see pytorch added a few more tutorial there but they are not helping me.
I am an example person, I understand things when I see them in example.
Here, I copy paste the example that is provided by pytorch for training a clasifier.
I modify it to be run on multiple gpus.
But I still have the same issues.
Can someone help me what I am missing and how I should fix it?
Thanks a lot!

import torch
import torchvision
import torchvision.transforms as transforms

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')


import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net().cuda()
net = nn.DataParallel(net)

import torch.optim as optim

criterion = nn.CrossEntropyLoss().cuda()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)


for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data 
        inputs, labels = inputs.cuda(), labels.cuda()

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')
  1. Have a look at @Thomas_Wolf’s blog post, where he explains the memory usage of the different devices and how to reduce the imbalance.

  2. Do you mean the performance regarding the model accuracy or throughput?

1 Like

@ptrblck

  1. I will take a look,
  2. yes I mean in terms of accuracy.

In that blog I saw it says if I use dataparallel I should do loss.mean().backward(), is it gonna be wrong if i do loss.backward() instead?

I used the guide in the @Thomas_Wolf blog but it gave me error.

Here is my code:


import torch
import torchvision
import torchvision.transforms as transforms
from parallel import DataParallelModel, DataParallelCriterion

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')


import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net().cuda()
parallel_model = DataParallelModel(net)

import torch.optim as optim

criterion = nn.CrossEntropyLoss().cuda()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

parallel_loss = DataParallelCriterion(criterion)

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data 
        inputs, labels = inputs.cuda(), labels.cuda()

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = parallel_model(inputs)
        loss = parallel_loss(outputs, labels)
        loss.sum().backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

and I got this error:

ValueError: only one element tensors can be converted to Python scalars

@ptrblck Hello,I read the blog and ZhangHang’s code.
Does Pytorch suppory computing the loss in a parallel fashion now?

The error comes most likely from

running_loss += loss.item()

as loss doesn’t seem to be a scalar.
You could uncomment one of these lines to reduce the loss.

@PistonY These scripts of @Thomas_Wolf provide this functionality.

@ptrblck should i get the same performance for both single gpu and multi-gpu training?

hmmmm not sure what does it mean to reduce the loss… (?)
my loss is like this when I print it [1.255, 54.2] (I have two gpus for this case)
@Thomas_Wolf can you help please?

Ideally your performance should scale with the number of GPUs.
What kind of performance do you measure using multiple GPUs vs. single one?

You could calculate the mean or sum of the loss and call backward() on the result.

it was a classification task, the performance got worse with increasing gpus which was not making sense to me.

I did do that, in the code that i posted I used loss.sum().backward(), isnot that what you mean?

i still cannot make it work :confused:

I have similar problem which is as follows:

I have a Unet structure which I try to feed back some feature maps, and I am using multi-gpu
when I try to run my code with only one gpu core it throws this error

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

but when I run it with two cores it goes fin.
here how I use the backward

curr_loss = self.criterion(smax_outputs, label)

    if self.gpustat['n_gpus'] >= 2:
      Dev_Inc = max(self.cfg['TRAIN_BATCH_SIZE'] // self.gpustat['n_gpus'], 1)
      idx = torch.ones(Dev_Inc).cuda()
      curr_loss.backward(idx)
    else:
      curr_loss.backward()
    
    self.optimizer.step()
    # zero the parameter gradients
    self.optimizer.zero_grad()

it is not a problem of memory…

The root cause of this error is unfortunately not visible in this code snippet.
Could you post a minimal executable code snippet to reproduce this error with your model and random data, so that we could have a look?