GPU memory usage is growing after training

MikeHsu · December 17, 2019, 1:32pm

He guys, I am using U-net and RNN.
I found my GPU memory is increasing after each step instead of remaining stable.
After about 40 steps, run out of memory error occurs.
Seems there is something stock in GPU memory and cause leak.
I am using RTX 2080 ti.
My code is here: https://github.com/vagr8/R_Unet/blob/master/R_Unet.py
Thanks!

albanD · December 17, 2019, 1:53pm

This most likely happen because you either store things in a list that ever grows at each iteration or if you hold onto the computational graph of the whole history.

You should be able to check the first one by printing list size in your code (in particular, buffer)

For the second one, you want to make sure that if you store things for which you only want the value, and not gradients to be backpropagated, use .detach().
You can you use torchviz to print the computational graph associated with your loss to make sure it does not grow at every iteration. Otherwise you need to identify where it links to the previous operations, and use .detach() to break it at this point.

MikeHsu · December 18, 2019, 3:19am

Thank you!
To test lists/buffer problem, I disable my RNN unit. After that, memory usage looks right.
Now I know where the problem is.

MikeHsu · December 18, 2019, 7:53am

I wrote a (1) simple LSTM class and the memory usage is unusual.
However using (2) LSTM directly without writing a LSTM class to use will not cause problem.

What makes the different from using sequential class for RNN and directly use RNN?

Here is the code for (1) and (2):

github.com

vagr8/R_Unet/blob/master/rnn_test.py

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import os
from torch.autograd import Variable
import torch.autograd as autograd
import psutil
import gc

cuda_gpu = torch.cuda.is_available()

class LSTMM(nn.Sequential):
    def __init__(self):
        super(LSTMM, self).__init__()
        self.lstm = nn.LSTM(10, 30, batch_first=True)
        self.h0 = Variable(torch.randn(1, 50, 30))
        self.c0 = Variable(torch.randn(1, 50 ,30))

This file has been truncated. show original

Edit: I found the way I provide code is hard to read, so I put a Github link instead.

MikeHsu · December 18, 2019, 9:17am

After periodically detach hidden layers in rnn, problem is solved.
Thank you!