RuntimeError: CUDA out of memory during loss.backward()

I am encountering the following error during my training run:

RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 11.93 GiB total capacity; 11.32 GiB already allocated; 81.06 MiB free; 72.23 MiB cached)

I have tried the following approaches to solve the issue, all to no avail:

  1. reduce batch size, all the way down to 1

  2. remove everything to CPU leaving only the network on the GPU

  3. remove validation code, and only executing the training code

  4. reduce the size of the network (I reduced it significantly: details below)

  5. I tried scaling the magnitude of the loss that is backpropagating as well to a much smaller value

None of the above have worked. My code is crashing after just a few batches into the very first epoch. Depending on the batch size, it either crashes after a few more batches (when batch size is smaller) or less batches ( when batch size is larger ( say 16) )

Typically it crashes after around 5 to 8 batches of the 1st epoch

The following is the entire trace back:

sys:1: RuntimeWarning: Traceback of forward call that caused the error:
  File "main.py", line 411, in <module>
    intensity.cnn_mse(au_net, config)
  File "/work/satrajic/FacialExpression/Experiments/intensity.py", line 1431, in cnn_mse
    corr_factor = corr_net(inputs).reshape([BATCH_SIZE, 2, 5]).cpu()
  File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/satrajic/FacialExpression/Experiments/base_model.py", line 701, in forward
    x = self.features(x)
  File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/nn/modules/pooling.py", line 146, in forward
    self.return_indices)
  File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/_jit_internal.py", line 133, in fn
    return if_false(*args, **kwargs)
  File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 494, in _max_pool2d
    input, kernel_size, stride, padding, dilation, ceil_mode)

Traceback (most recent call last):
  File "main.py", line 411, in <module>
    intensity.cnn_mse(au_net, config)
  File "/work/satrajic/FacialExpression/Experiments/intensity.py", line 1474, in cnn_mse
    loss.backward()
  File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 11.93 GiB total capacity; 11.32 GiB already allocated; 80.06 MiB free; 72.23 MiB cached)

It seems like the issue is happening during the backward pass, while it is trying to store the gradients. Something seems to be ballooning in memory size. However I cannot figure out what is exactly causing it to do so, or how I should fix it.

Something of note is that the loss values are very very large. I tried manually scaling them down by orders of magnitude. But that did not work either.

I would really appreciate any help on this as I am completely perplexed

2 Likes

Hi,

Could you show the part of the code that does the training? In particular how do you call .backward() and how you do the forward pass in your model?

Sure!

The following is the training code where I get the outputs, compute losses and call.backward()

train_data = utils.IntensityLoader(data_path=TRAIN_DATA_PATH, batch_size=BATCH_SIZE,
                                                       index=i).loaded_data()
                    inputs, truth = train_data[0][0], train_data[0][1]
                    inputs = inputs.to(device)
                    truth = truth
                    truth = truth.float()
                    optimizer.zero_grad()
                    outputs1, outputs2, outputs3, outputs4, outputs5 = au_net(inputs)
                    
                    # I'm putting the outputs on CPU to save GPU memory
                    outputs1 = outputs1.cpu()
                    outputs2 = outputs2.cpu()
                    outputs3 = outputs3.cpu()
                    outputs4 = outputs4.cpu()
                    outputs5 = outputs5.cpu()

                    corr_factor = corr_net(inputs).reshape([BATCH_SIZE, 2, 5]).cpu()
                    corr_factor = torch.bmm(corr_factor, corr_factor.permute(0, 2, 1))
                    del_corr = torch.bmm(del_cor, del_cor.permute(0, 2, 1))
                    """
                    loss_2 is the CorrNet loss - this is a separate branch network with custom loss function
                    """
                    loss_2_1 = torch.Tensor(2, 2).fill_(-1.0) * (corr_factor + del_corr) * torch.bmm(
                        (outputs1 - train_AU_mean.float()).unsqueeze(1),
                        (outputs1 - train_AU_mean.float()).unsqueeze(1).permute(0, 2, 1))

                    loss_2_2 = torch.Tensor(2, 2).fill_(-1.0) * (corr_factor + del_corr) * torch.bmm(
                        (outputs2 - train_AU_mean.float()).unsqueeze(1),
                        (outputs2 - train_AU_mean.float()).unsqueeze(1).permute(0, 2, 1))

                    loss_2_3 = torch.Tensor(2, 2).fill_(-1.0) * (corr_factor + del_corr) * torch.bmm(
                        (outputs3 - train_AU_mean.float()).unsqueeze(1),
                        (outputs3 - train_AU_mean.float()).unsqueeze(1).permute(0, 2, 1))

                    loss_2_4 = torch.Tensor(2, 2).fill_(-1.0) * (corr_factor + del_corr) * torch.bmm(
                        (outputs4 - train_AU_mean.float()).unsqueeze(1),
                        (outputs4 - train_AU_mean.float()).unsqueeze(1).permute(0, 2, 1))

                    loss_2_5 = torch.Tensor(2, 2).fill_(-1.0) * (corr_factor + del_corr) * torch.bmm(
                        (outputs5 - train_AU_mean.float()).unsqueeze(1),
                        (outputs5 - train_AU_mean.float()).unsqueeze(1).permute(0, 2, 1))

                    loss_2 = (loss_2_1 + loss_2_2 + loss_2_3 + loss_2_4 + loss_2_5) / 5.0

                    corr_loss = torch.mean(loss_2[torch.triu(torch.ones(BATCH_SIZE, 2, 2)) == 1])

                    print ("corr loss", corr_loss)
                    loss_list = list()
                    loss = torch.zeros(1)
                    
                    # calculating losses for every head
                    for index in range(len(truth)):
                        sub_loss_1 = AU_criterion(outputs1[index].unsqueeze(0), truth[index][0].unsqueeze(0))
                        sub_loss_2 = AU_criterion(outputs2[index].unsqueeze(0), truth[index][1].unsqueeze(0))
                        sub_loss_3 = AU_criterion(outputs3[index].unsqueeze(0), truth[index][2].unsqueeze(0))
                        sub_loss_4 = AU_criterion(outputs4[index].unsqueeze(0), truth[index][3].unsqueeze(0))
                        sub_loss_5 = AU_criterion(outputs5[index].unsqueeze(0), truth[index][4].unsqueeze(0))
                        loss += sub_loss_1 + sub_loss_2 + sub_loss_3 + sub_loss_4 + sub_loss_5

                    loss = loss + 4.0 + loss_weight * corr_loss
                    # loss = loss * 10e-9  # attempted to scale the loss to a smaller value. That did not work
                    loss.backward()
                    print ("total loss", loss)
                    optimizer.step()
                    running_loss += loss.item()

The following is how I define the optimizer:

params = list(au_net.parameters()) + list(corr_net.parameters())
    optimizer = optim.Adam(params, lr=LEARNING_RATE, betas=(0.9, 0.999), eps=0.1)

au_net is a VGG setup with 5 FC heads for 5 outputs. Each head maps 4096x1. I later reduced that to 24x1 to reduce the size of the network. But that did not work either

corr_net is a CNN setup with a FC layer at the end

I am using MSE to compute the loss for the au_net. That is defined as the AU_criterion in the code

Hi,

You seem to create a lot of intermediary results but it is indeed weird that it uses 12GB.
I can’t see anything obviously wrong with the code.
I would try and remove some of the computations and see how it changes things. You should also print the size of few Tensors to make sure they are what you expect.

I don’t think the computations in the training code should affect the backward pass because I made sure to move everything onto the CPU. By the way, I also tried using torch.cuda.empty_cache() to clear the cache after every batch. It still didn’t help.

The rest of the tensor sizes are fixed to known dimensions. The only thing that is standing out to me as weird is that the values of the losses themselves are very very large. Is there any way that this is what is causing the issue somehow? Do you have any suggestions about what I could possibly do to fix this or find a work around to avoid this issue?

The intermediate activations in the model will still be on the GPU, as you’ve just moved the outputs to the CPU. All operations afterwards will be executed on the CPU. However, once your backward call reaches the model, it will transfer back to the GPU.

1 Like

The very large values are not causing memory problems for sure, but they might be the symptom of another issue.
I know I had issues when computing loss, if you have a tensor of size batch_size and another of size batch_size x 1 then because of the broadcasting semantic, if you sum or multiply element-wise these tensors, you will get a batch_size x batch_size tensor. This wrong size can easily be hidden when you average the loss with corr_loss = torhc.mean(some_tensor) for example. This kind of silent errors would cause both the surprisingly large loss and the unexpected memory usage.

1 Like

Thanks a lot!

This is very helpful. Just to put it out there though, I solved the problem by going with a different architecture.

After decrease the network complexity, change from Adam to AdamW optimizer and reducing the dataset I finally solved this issue by decreasing batch_size from 32 to 16. Thank you!

1 Like