Increase the CUDA memory twice then stop increasing

klory · October 31, 2018, 5:10pm

I have the code below and I don’t understand why the memory increase twice then stops

I searched the forum and can not find answer

env: PyTorch 0.4.1, Ubuntu16.04, Python 2.7, CUDA 8.0/9.0

from torchvision.models import vgg16
import torch
import pdb

net = vgg16().cuda()
data1 = torch.rand(16,3,224,224).cuda()

for i in range(10):
    pdb.set_trace()
    out1 = net(data1)

first stop, this is what data1 and vgg16 take
second stop, this is what the intermediate status of vgg16 take
third stop, WHY it increase again?
forth stop, WHY it stops increasing?

colesbury · October 31, 2018, 8:58pm

The memory is from the output out1 and intermediate activations needed to compute the gradient. The first increase is from computing out1. The second increase is from computing net(data1) while out1 is still alive. The reason is that in:

out1 = net(data1)

The right-hand side net(data1) is evaluated before the assignment. Memory usage, as reported by the system, doesn’t generally decrease. If it had, then it would decrease back to 2872Mi after the assignment operation.

You can rewrite your program to avoid keeping two versions of out1 alive at once:

def eval(network, input):
  out1 = network(input)
  # maybe use out1 here

for i in range(10):
  eval(net, data1)

As long as you don’t return out1 from eval, out1 will be freed before the next call, so you’ll only use 2872Mi.

klory · October 31, 2018, 10:41pm

But why it did not increase for the third time?

Is there some optimization in the compiler of PyTorch, e.g., doing the first Backprop and the second Forward at the same time? this is the only reason I could think of why there are the two copies

colesbury · November 4, 2018, 5:34pm

The old value in out1 gets deleted since there are no longer any references to it. It holds onto all the internal state needed to compute the gradient, so when it gets deleted that internal state gets deleted too.

farazk86 · June 12, 2019, 7:44pm

Hi,

I changed my code as suggested but my GPU memory still doubles during evaluation. Wrapping my evaluation code in with torch.no_grad() helped the code run as I am no longer getting CUDA out of memory errors but my GPU memory still doubles

Before reading your post my code was structured like:

for i in range(50):
    for idx, (inputs, targets) in enumerate(loaders):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()

        outputs = model(inputs)

        loss = loss_criterion(outputs, targets)

        predicted_output = torch.argmax(outputs.detach(), 0).squeeze(0)


        error_ = error(predicted_output.data.cpu(), targets.data.cpu().long().squeeze())

        running_loss.update(loss.item(), inputs.size(0))
        running_error.update(error_.item(), inputs.size(0))

But after your suggestion I changed it to:

def forward_pass(model_, input_, target_, loss_):
    outputs = model_(input_)

    loss = loss_(outputs, target_)

    predicted_output = torch.argmax(outputs.detach(), 0).squeeze(0)

    return loss, predicted_output

for i in range(50):
    for idx, (inputs, targets) in enumerate(loaders):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()

        loss, predicted_output = forward_pass(model, inputs, targets, loss_criterion)

        error_ = error(predicted_output.data.cpu(), targets.data.cpu().long().squeeze())

        running_loss.update(loss.item(), inputs.size(0))
        running_error.update(error_.item(), inputs.size(0))

this made no difference on memory usage.

Thanks

colesbury · June 12, 2019, 9:55pm

That’s because you still assign inputs, targets, loss, etc. in the for loop and not in the function.

farazk86 · June 13, 2019, 1:33pm

Sorry if I misunderstood. Based on what you mentioned above:

colesbury:

You can rewrite your program to avoid keeping two versions of out1 alive at once:
def eval(network, input):
  out1 = network(input)
  # maybe use out1 here

for i in range(10):
  eval(net, data1)
As long as you don’t return out1 from eval , out1 will be freed before the next call, so you’ll only use 2872Mi .

I took the prediction of output out of the for loop and as suggested used the output in that function and calculated loss and accuracy there and only returned the loss and predicted tensor. output was not returned from the function.

I would eventually have to assign the input in a for loop as that is how I would get data from mt DataLoader in order to pass to the evaluation function. I dont see how I would get this to work without returning loss for loss.backward() and without using a for loop fro the data loading.

colesbury · June 13, 2019, 7:32pm

That’s because you’re still returning “loss” which unless you have torch.no_grad() holds onto a lot of data so that you can call loss.backward() in the future.

Put everything in a function (including the GPU copies of input and targets) or use del.

...

def step(input, targets):
    inputs, targets = inputs.to(device), targets.to(device)
    optimizer.zero_grad()

    loss, predicted_output = forward_pass(model, inputs, targets, loss_criterion)

    error_ = error(predicted_output.data.cpu(), targets.data.cpu().long().squeeze())

    running_loss.update(loss.item(), inputs.size(0))
    running_error.update(error_.item(), inputs.size(0))

for i in range(50):
    for idx, (inputs, targets) in enumerate(loaders):
        step(input, targets)

klory · June 14, 2019, 2:03pm

Thank you colesbury, that helps.