Lesser memory consumption with a larger batch in multi GPU setup

Lapack is way to efficiently slice and dice matrix to use memory on CPUs to make matrix computations… in case of GPUs , PyTorch uses Magma if your pytorch was built with this resource enabled to better use your GPU memory using lapack interface

I installed pytorch in my virtual environment using pip3 install torch torchvision as it says on the site.

So I believe that your pytorch is not using lapack at all to better uses the memory on any of your devices.

Could you please be more specific with your responses?

What do you recommend to do and provide some proofs in favor of the things that you are suggesting. For now, I do not understand whether I should try to install these. Why these libraries are not dependencies for PyTorch then? Why the official Get Started page has no entries of any of these packages?

Sure… sorry…

The same way Numpy doesn’t tell you that if you want to better use your CPU sending 99.9% of the computation to L1, L2, L3 Cache, you need to build or grab an already built numpy package that uses lapack to do so.
This kind of thing is for people that knows about Linear algebra and need to administrate cluster at maximum power not popular thing. I believe that public software doesn’t came optimized to each machine so you need to optimize to your resources.

Magma is that for PyTorch on GPUs

Other thing you will need to grab a pre-built pytorch with magma support to have access to this or you will need to build your own pytorch with this resource enabled.
I don’t use anaconda so I cannot tell you for sure if there is pre-built package that recognizes or not if you have magma installed or not.

If you want to keep using pytorch as a generic thing then stop worrying about maximum performance or question about GPU consumption because this is not area for consumers.
Now if you want to increase your memory usage and have a package that can speeds up your training and better use your GPUs and memory consumption then you will need to start study that too.

Isn’t that related to what we’ve discussed in Bug in DataParallel? Only works if the dataset device is cuda:0 - #12 by rasbt? I.e., that the results are intermittently gathered on one of the devices?

623cbf5248b1ccc24d4d4895bfefee98b8abd682_1_690x244

If the batch size is larger, there will be more stuff to be gathered, which is what could explain why the difference is more pronounced if you increase the batch size.

@cyberwillis Why are you assuming that I am using anaconda?

Are you experiencing the same problem or not? How you tried to replicate my results or at least run the code on a machine with multiple GPU? Is it well known pytorch behavoiur or it fails only in my case.

I hate myself for saying this, but I have not seen any reasons why should I try to build pytorch from source with Magma enabled.

Hi

I did not tried to imply that you are using anaconda, but in fact almost all packages in pip repository are generic and you will not find any optimized package there (just generic compiled).

I don’t experience the same problem as you because I build my own software and optimize it to my resources.

I don’t even need to try to setup one box with exact same resources as you have, to replicate your problem, because I understand you are just using the generic packages in all the pieces of it (cuda, cudnn, pytorch) and complaining about memory inefficiency, instead of looking in how to optimized them.

About your hated question. First try to understand what lapack does, and why magma was created.
Performance on computations and memory efficiency depends on it.

Your main question is not a problem of pytorch. Its a problem of a non optimized setup.

Hi @rasbt

If you get sometime, see if you can help him on this.

May be you are seeing things I am not seeing.

What I was trying to say was that this looks like expected behavior to me. The first GPU should require and utilize more memory because that’s where the gathering of the outputs and computation of the loss happens. Like it’s outlined in the graphic I shared from the other discussion.

The results the OP showed, showed the 2nd and 3rd GPU have an evenly distributed memory usage, which is exactly what I would expect.

First thing:

I have made a couple of plots to illustrate the facts that I cannot explain. I hope that I will have a clearer picture in mind with your help.

Plots

Why is it expectable for you? Please, elaborate why you have expected the non-positive trend in memory usage as batch size increases. Only V100 GPU:0 has a positive trend as expected.

Second thing:

Earlier I encountered a rather unusual behaviour of my 3x2080 Ti compared to 3x1080Ti. I had the same code, data, and set of deep learning libraries.

With 3x1080Ti I could feed a batch of size 3x85 (255) and memory was allocated evenly (a bit more on 0th as expected but not that much). The problem occurred when I tried to fit the same batch to 3x2080Ti. Of course, a 2080Ti has less memory but only 3x50 fitted well and memory allocation was unbalanced similar to this situation. Therefore, I started to think that non-0-th-GPUs actually have a proper amount of memory and the 0th one is inflated.

Updated the initial question by adding more clarification and evidence.

That’s an interesting observation @Vladimir_Iashin!
Thanks for the detailed analysis.
Besides the explanations in this thread, I’d like to dig a bit deeper.
Unfortunately I don’t have the same setup here, so could you try to disable cuDNN for your experiments and try to re-run the training using batch sizes with the most obvious memory drops?
To disable cuDNN, just use torch.backends.cudnn.enabled = False at the beginning of your training.
Also these runs with

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

would be interesting to see.

1 Like

Okay, these are results of the experiments. Please, note that .benchmark is disabled by default anyway. There is no need to specify it.

  1. Default setup
  2. enable = False and deterministic = True
  3. Only deterministic = True
  4. Only enable = False
3x2080Ti

3xV100

Code
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

## Exp. 1:
# torch.backends.cudnn.enabled = False
# torch.backends.cudnn.deterministic = True

## Exp. 2:
# torch.backends.cudnn.deterministic = True

## Exp. 3:
# torch.backends.cudnn.enabled = False

B = 4400

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=B, shuffle=False)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

cfg = {
    'VGG16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
}


class VGG(nn.Module):
    def __init__(self, vgg_name):
        super(VGG, self).__init__()
        self.features = self._make_layers(cfg[vgg_name])
        self.classifier = nn.Linear(512, 10)

    def forward(self, x):
        out = self.features(x)
        out = out.view(out.size(0), -1)
        out = self.classifier(out)
        return out

    def _make_layers(self, cfg):
        layers = []
        in_channels = 3
        for x in cfg:
            if x == 'M':
                layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
            else:
                layers += [nn.Conv2d(in_channels, x, kernel_size=3, padding=1),
                           nn.BatchNorm2d(x),
                           nn.ReLU(inplace=True)]
                in_channels = x
        layers += [nn.AvgPool2d(kernel_size=1, stride=1)]
        return nn.Sequential(*layers)

net = VGG('VGG16')

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.0001, momentum=0.9)

device = "cuda"
torch.cuda.set_device(0)
net.to(device);
net = nn.DataParallel(net, device_ids=[0, 1, 2])

for epoch in range(5):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs
        inputs, labels = data
        
        inputs, labels = inputs.cuda(device, async=True), labels.cuda(device, async=True)
        inputs, targets = torch.autograd.Variable(inputs), torch.autograd.Variable(labels)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    
    print('[{:d}, {:5f}]'.format(epoch+1, loss.item()))

Also, I have had a quite fruitful chat with @ptrblck these are some insightful things from it:

  1. pip3 install torch torchvision installs not only PyTorch but also binaries like a proper CUDA and CUDNN. So, only GPU drivers are prerequisite for PyTorch, at least for most of the modern GPUs;
  2. I have noticed that during my installation I only added path for CUDA and didn’t for CUDNN. However, it is not the case because, apparently, PyTorch uses its own binaries (see the previous note).

At this point, I am using torch.backends.cudnn.deterministic = True flag which reduces the memory consumption and, in my case, does not significantly slow down computations. However, I am still experiencing non-linearity with an increase in batch size.

The question is still open though.

Also, another evidence, when the flag is False, the code throws CUDA out of memory error at the beginning of the inference stage even when I wrap it in with torch.no_grad() block.

Hi @Vladimir_Iashin,
Would it be possible to know how did you code the GPU consumption profile?
Thanks in advance

Hi, I am using nvtop tool. Basically, I run the code and look at the numbers. Please see screenshots in “Previous version of the question” or a couple of those below:

Screenshots

17

06

I would be happy if I could do it automatically somehow but I don’t a know tool with such functionality. So, I did it manually.

1 Like