High GPU Memory Demand for pytorch?

Hi,

I have a variant of inception v3 network, which is pre-trained and fixed. Now I want to do the inference for that network.

I tried both tensorflow and pytorch.

Tensorflow: I saved the network as graphdef, and use it for inference. I can use 128 batch size for inference.
Pytorch: I put the model in evaluation mode, but I cannot use 128 batch size, which will give me out of memory error.

I don’t know why this happens. My guess is that tensorflow may not cache the intermediate feature maps in the graphdef mode, but pytorch may do.

I also had other problems related to the memory usage. My recent experience suggests that pytorch often uses higher gpu memory than tensorflow. In many cases, I have to reduce batch size, which may or may not solve my problem. Are there any suggestions on how to use pytorch more memory efficiently?

7 Likes

You should make sure your input Variables have volatile=True. That tells torch.autograd not to build a graph for backpropagation and that it can free unneeded tensors during the forward pass.

3 Likes

I’d recommend reading the autograd notes.

Also, you might want to try torch.backends.cudnn.benchmark = True to improve the speed.

2 Likes

Thanks for the pointer. I think that my problem might be a little different.

I have two networks A and B, the output of A will be fed to network B, and the loss is defined on the top of network B. The network B is fixed and network A is to be optimized.

So I need network B to back propagate the gradients from loss to A, but these gradients do not need to be saved since network B doesn’t need to be updated. I am wondering that in this case, can I set all variables in B to be volatile=True?

Is there some way for me to know whether the gradients of parameters in B are buffered or not?

1 Like

In your particular case, you need to set B’s parameters to not require gradients.
Do this:

# let's assume input is a Tensor, A and B are networks

optimizerA = optim.SGD(A.parameters(), ...)

# freeze B's params
for p in B.parameters():
    p.requires_grad = False

iv = Variable(input)
optimizerA.zero_grad()
out1 = A(iv)
out2 = B(out1)
out2.backward()
optimizer.step()
1 Like

Thanks. This is what I did. One quick question is that if I set requires_grad = False, does it save extra gpu memory for me?

Yes it saves memory at certain places

OK, I felt a little frustrated. I did everything I could to reduce GPU memory. But the usage of my pytorch code is still more than twice memory-consuming than my tensorflow implementation. Because the code is confidential, I cannot release it in public.

My network B is an inception v3 network. I found that after I passed the output of A to B, the GPU memory increases 4 GB. The batch size is only about 20.

without looking further, I cant unfortunately comment further :frowning: but we’ve benchmarked memory usage on ResNets, Alexnet etc. and we’re on par with or better than other frameworks.

if you can give a small script that showcases your code (without giving out your actual code), happy to take a look.

Is tensor data type is same in PyTorch and TensorFlow?
TensorFlow provides only float32 but Pytorch seems to provide both float32 and float64(double).
If you use float64 in PyTorch, it is normal that the allocated memory is as twice as TensorFlow.

@gaoking132 if you can’t share the model, but you can share the training loop, that might still help. Maybe you have some small bug there.

In finetuning example, tensorflow batch size is more double than pytorch with resnet152 model and same machine.
I don’t exactly know but I saw pytorch error’s occuring in “input.new()” in batchnorm, so I guess two code treats different intermediate buffer with different ways.
if tensorflow use cpu memory to allocate intermediate varaible, goodness would be memory efficiency, but badness would be speed down by frequent cuda memory copy.

Thank you for all the helpful replies.

The code for both pytorch and tensorflow implementation is a little complex, which certainly cannot be used as a fair comparison. I don’t have too much time to extract a small part of the code for debugging purpose. Sorry for not providing helpful feedbacks on this thread.

This is my test codes for comparing pytorch and tensorflow

Below codes is a pytorch code of fintuning flower example
in my machine gtx980ti , the batch size of pytorch, 8 is available, but 16 is not

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torchvision import datasets, transforms
from torch.autograd import Variable
import matplotlib.pyplot as plt
import numpy as np

is_cuda = torch.cuda.is_available() # if cuda is avaible, True
traindir = './flower_photos'

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

batch_size = 16 
train_loader = torch.utils.data.DataLoader(
    datasets.ImageFolder(traindir,
                         transforms.Compose([
                             transforms.RandomSizedCrop(224),
                             transforms.RandomHorizontalFlip(),
                             transforms.ToTensor(),
                             normalize,])),
    batch_size=batch_size,
    shuffle=True,
    num_workers=4)

cls_num = len(datasets.folder.find_classes(traindir)[0])

test_loader = torch.utils.data.DataLoader(
    datasets.ImageFolder(traindir,
                         transforms.Compose([
                             transforms.RandomSizedCrop(224),
                             transforms.RandomHorizontalFlip(),
                             transforms.ToTensor(),
                             normalize,])),
    batch_size=batch_size,
    shuffle=True,
    num_workers=1)

model = torchvision.models.resnet152(pretrained = True)

### don't update model parameters
for param in model.parameters() :
    param.requires_grad = False
#modify last fully connected layter
model.fc = nn.Linear(model.fc.in_features, cls_num)

fc_parameters = [
    {'params': model.fc.parameters()},
]
optimizer = torch.optim.Adam(fc_parameters, lr=1e-4, weight_decay=1e-4)
loss_fn = nn.CrossEntropyLoss() 

if is_cuda : model.cuda(), loss_fn.cuda()
# trainning

model.train()
train_loss = []
train_accu = []
i = 0
for epoch in range(1):
    for image, target in train_loader:
        image, target = Variable(image.float()), Variable(target) 
        if is_cuda :  image, target = image.cuda(), target.cuda() 
        output = model(image) 
        loss = loss_fn(output, target) 
        optimizer.zero_grad() 
        loss.backward() 
        optimizer.step() 
        
        pred = output.data.max(1)[1]
        accuracy = pred.eq(target.data).sum()/batch_size
        
        train_loss.append(loss.data[0])
        train_accu.append(accuracy)

        if i % 300 == 0:
            print(i, loss.data[0])
        i += 1

below is tensorflow-slm code,
tensorflow implementation

images, _, labels = load_batch(dataset,  batch_size=256, height=image_size, width=image_size)

in same network model, same machine(gtx980ti),
the batch size of tensorflow, 256 is available, 512 is not

Thanks! We’ll look into that!

I checked it out and there was a problem in autograd indeed. We haven’t been freeing some buffers soon enough. I’ve opened a PR with a fix. After that commit batch_size 256 uses the same amount of memory (4,7GB on my GPU) as batch_size 16 before.

Thanks for posting the repro!

3 Likes

Thank you for a prompt response!
I’ll love pytorch more than yesterday. :slight_smile:

2 Likes

Thanks, it also solves my problem.

1 Like

Hello! It is supposed to work as -> torch.no_grad() ? Because with torch.no_grad(): my model inferences consumes the same memory as in train… It is normal?