High GPU Memory Demand for pytorch?

gaoking132 · February 22, 2017, 6:45pm

Hi,

I have a variant of inception v3 network, which is pre-trained and fixed. Now I want to do the inference for that network.

I tried both tensorflow and pytorch.

Tensorflow: I saved the network as graphdef, and use it for inference. I can use 128 batch size for inference.
Pytorch: I put the model in evaluation mode, but I cannot use 128 batch size, which will give me out of memory error.

I don’t know why this happens. My guess is that tensorflow may not cache the intermediate feature maps in the graphdef mode, but pytorch may do.

I also had other problems related to the memory usage. My recent experience suggests that pytorch often uses higher gpu memory than tensorflow. In many cases, I have to reduce batch size, which may or may not solve my problem. Are there any suggestions on how to use pytorch more memory efficiently?

jekbradbury · February 22, 2017, 6:55pm

You should make sure your input Variables have volatile=True. That tells torch.autograd not to build a graph for backpropagation and that it can free unneeded tensors during the forward pass.

apaszke · February 22, 2017, 9:13pm

I’d recommend reading the autograd notes.

Also, you might want to try torch.backends.cudnn.benchmark = True to improve the speed.

gaoking132 · February 23, 2017, 8:20pm

Thanks for the pointer. I think that my problem might be a little different.

I have two networks A and B, the output of A will be fed to network B, and the loss is defined on the top of network B. The network B is fixed and network A is to be optimized.

So I need network B to back propagate the gradients from loss to A, but these gradients do not need to be saved since network B doesn’t need to be updated. I am wondering that in this case, can I set all variables in B to be volatile=True?

Is there some way for me to know whether the gradients of parameters in B are buffered or not?

smth · February 23, 2017, 8:31pm

In your particular case, you need to set B’s parameters to not require gradients.
Do this:

# let's assume input is a Tensor, A and B are networks

optimizerA = optim.SGD(A.parameters(), ...)

# freeze B's params
for p in B.parameters():
    p.requires_grad = False

iv = Variable(input)
optimizerA.zero_grad()
out1 = A(iv)
out2 = B(out1)
out2.backward()
optimizer.step()

gaoking132 · February 23, 2017, 8:33pm

Thanks. This is what I did. One quick question is that if I set requires_grad = False, does it save extra gpu memory for me?

smth · February 23, 2017, 8:34pm

Yes it saves memory at certain places

gaoking132 · February 23, 2017, 8:40pm

OK, I felt a little frustrated. I did everything I could to reduce GPU memory. But the usage of my pytorch code is still more than twice memory-consuming than my tensorflow implementation. Because the code is confidential, I cannot release it in public.

My network B is an inception v3 network. I found that after I passed the output of A to B, the GPU memory increases 4 GB. The batch size is only about 20.

smth · February 23, 2017, 8:44pm

without looking further, I cant unfortunately comment further but we’ve benchmarked memory usage on ResNets, Alexnet etc. and we’re on par with or better than other frameworks.

smth · February 23, 2017, 8:45pm

if you can give a small script that showcases your code (without giving out your actual code), happy to take a look.

yunjey · February 24, 2017, 3:01am

Is tensor data type is same in PyTorch and TensorFlow?
TensorFlow provides only float32 but Pytorch seems to provide both float32 and float64(double).
If you use float64 in PyTorch, it is normal that the allocated memory is as twice as TensorFlow.

apaszke · February 24, 2017, 9:34pm

@gaoking132 if you can’t share the model, but you can share the training loop, that might still help. Maybe you have some small bug there.

jhjungCode · February 25, 2017, 4:02am

In finetuning example, tensorflow batch size is more double than pytorch with resnet152 model and same machine.
I don’t exactly know but I saw pytorch error’s occuring in “input.new()” in batchnorm, so I guess two code treats different intermediate buffer with different ways.
if tensorflow use cpu memory to allocate intermediate varaible, goodness would be memory efficiency, but badness would be speed down by frequent cuda memory copy.

gaoking132 · February 25, 2017, 10:57pm

Thank you for all the helpful replies.

The code for both pytorch and tensorflow implementation is a little complex, which certainly cannot be used as a fair comparison. I don’t have too much time to extract a small part of the code for debugging purpose. Sorry for not providing helpful feedbacks on this thread.

jhjungCode · February 26, 2017, 4:56am

This is my test codes for comparing pytorch and tensorflow

Below codes is a pytorch code of fintuning flower example
in my machine gtx980ti , the batch size of pytorch, 8 is available, but 16 is not

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torchvision import datasets, transforms
from torch.autograd import Variable
import matplotlib.pyplot as plt
import numpy as np

is_cuda = torch.cuda.is_available() # if cuda is avaible, True
traindir = './flower_photos'

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

batch_size = 16 
train_loader = torch.utils.data.DataLoader(
    datasets.ImageFolder(traindir,
                         transforms.Compose([
                             transforms.RandomSizedCrop(224),
                             transforms.RandomHorizontalFlip(),
                             transforms.ToTensor(),
                             normalize,])),
    batch_size=batch_size,
    shuffle=True,
    num_workers=4)

cls_num = len(datasets.folder.find_classes(traindir)[0])

test_loader = torch.utils.data.DataLoader(
    datasets.ImageFolder(traindir,
                         transforms.Compose([
                             transforms.RandomSizedCrop(224),
                             transforms.RandomHorizontalFlip(),
                             transforms.ToTensor(),
                             normalize,])),
    batch_size=batch_size,
    shuffle=True,
    num_workers=1)

model = torchvision.models.resnet152(pretrained = True)

### don't update model parameters
for param in model.parameters() :
    param.requires_grad = False
#modify last fully connected layter
model.fc = nn.Linear(model.fc.in_features, cls_num)

fc_parameters = [
    {'params': model.fc.parameters()},
]
optimizer = torch.optim.Adam(fc_parameters, lr=1e-4, weight_decay=1e-4)
loss_fn = nn.CrossEntropyLoss() 

if is_cuda : model.cuda(), loss_fn.cuda()
# trainning

model.train()
train_loss = []
train_accu = []
i = 0
for epoch in range(1):
    for image, target in train_loader:
        image, target = Variable(image.float()), Variable(target) 
        if is_cuda :  image, target = image.cuda(), target.cuda() 
        output = model(image) 
        loss = loss_fn(output, target) 
        optimizer.zero_grad() 
        loss.backward() 
        optimizer.step() 
        
        pred = output.data.max(1)[1]
        accuracy = pred.eq(target.data).sum()/batch_size
        
        train_loss.append(loss.data[0])
        train_accu.append(accuracy)

        if i % 300 == 0:
            print(i, loss.data[0])
        i += 1

below is tensorflow-slm code,
tensorflow implementation

images, _, labels = load_batch(dataset,  batch_size=256, height=image_size, width=image_size)

in same network model, same machine(gtx980ti),
the batch size of tensorflow, 256 is available, 512 is not

apaszke · February 26, 2017, 12:41pm

Thanks! We’ll look into that!

apaszke · February 26, 2017, 9:47pm

I checked it out and there was a problem in autograd indeed. We haven’t been freeing some buffers soon enough. I’ve opened a PR with a fix. After that commit batch_size 256 uses the same amount of memory (4,7GB on my GPU) as batch_size 16 before.

Thanks for posting the repro!

jhjungCode · February 27, 2017, 1:10am

Thank you for a prompt response!
I’ll love pytorch more than yesterday.

gaoking132 · February 28, 2017, 3:45am

Thanks, it also solves my problem.

Mario_Parreno · December 17, 2018, 6:08pm

Hello! It is supposed to work as -> torch.no_grad() ? Because with torch.no_grad(): my model inferences consumes the same memory as in train… It is normal?