Lesser memory consumption with a larger batch in multi GPU setup

That’s an interesting observation @Vladimir_Iashin!
Thanks for the detailed analysis.
Besides the explanations in this thread, I’d like to dig a bit deeper.
Unfortunately I don’t have the same setup here, so could you try to disable cuDNN for your experiments and try to re-run the training using batch sizes with the most obvious memory drops?
To disable cuDNN, just use torch.backends.cudnn.enabled = False at the beginning of your training.
Also these runs with

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

would be interesting to see.

1 Like

Okay, these are results of the experiments. Please, note that .benchmark is disabled by default anyway. There is no need to specify it.

  1. Default setup
  2. enable = False and deterministic = True
  3. Only deterministic = True
  4. Only enable = False
3x2080Ti

3xV100

Code
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

## Exp. 1:
# torch.backends.cudnn.enabled = False
# torch.backends.cudnn.deterministic = True

## Exp. 2:
# torch.backends.cudnn.deterministic = True

## Exp. 3:
# torch.backends.cudnn.enabled = False

B = 4400

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=B, shuffle=False)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

cfg = {
    'VGG16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
}


class VGG(nn.Module):
    def __init__(self, vgg_name):
        super(VGG, self).__init__()
        self.features = self._make_layers(cfg[vgg_name])
        self.classifier = nn.Linear(512, 10)

    def forward(self, x):
        out = self.features(x)
        out = out.view(out.size(0), -1)
        out = self.classifier(out)
        return out

    def _make_layers(self, cfg):
        layers = []
        in_channels = 3
        for x in cfg:
            if x == 'M':
                layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
            else:
                layers += [nn.Conv2d(in_channels, x, kernel_size=3, padding=1),
                           nn.BatchNorm2d(x),
                           nn.ReLU(inplace=True)]
                in_channels = x
        layers += [nn.AvgPool2d(kernel_size=1, stride=1)]
        return nn.Sequential(*layers)

net = VGG('VGG16')

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.0001, momentum=0.9)

device = "cuda"
torch.cuda.set_device(0)
net.to(device);
net = nn.DataParallel(net, device_ids=[0, 1, 2])

for epoch in range(5):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs
        inputs, labels = data
        
        inputs, labels = inputs.cuda(device, async=True), labels.cuda(device, async=True)
        inputs, targets = torch.autograd.Variable(inputs), torch.autograd.Variable(labels)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    
    print('[{:d}, {:5f}]'.format(epoch+1, loss.item()))

Also, I have had a quite fruitful chat with @ptrblck these are some insightful things from it:

  1. pip3 install torch torchvision installs not only PyTorch but also binaries like a proper CUDA and CUDNN. So, only GPU drivers are prerequisite for PyTorch, at least for most of the modern GPUs;
  2. I have noticed that during my installation I only added path for CUDA and didn’t for CUDNN. However, it is not the case because, apparently, PyTorch uses its own binaries (see the previous note).

At this point, I am using torch.backends.cudnn.deterministic = True flag which reduces the memory consumption and, in my case, does not significantly slow down computations. However, I am still experiencing non-linearity with an increase in batch size.

The question is still open though.

Also, another evidence, when the flag is False, the code throws CUDA out of memory error at the beginning of the inference stage even when I wrap it in with torch.no_grad() block.

Hi @Vladimir_Iashin,
Would it be possible to know how did you code the GPU consumption profile?
Thanks in advance

Hi, I am using nvtop tool. Basically, I run the code and look at the numbers. Please see screenshots in “Previous version of the question” or a couple of those below:

Screenshots

17

06

I would be happy if I could do it automatically somehow but I don’t a know tool with such functionality. So, I did it manually.

1 Like