GPU memory usage is growing after training

He guys, I am using U-net and RNN.
I found my GPU memory is increasing after each step instead of remaining stable.
After about 40 steps, run out of memory error occurs.
Seems there is something stock in GPU memory and cause leak.
I am using RTX 2080 ti.
My code is here:

This most likely happen because you either store things in a list that ever grows at each iteration or if you hold onto the computational graph of the whole history.

You should be able to check the first one by printing list size in your code (in particular, buffer)

For the second one, you want to make sure that if you store things for which you only want the value, and not gradients to be backpropagated, use .detach().
You can you use torchviz to print the computational graph associated with your loss to make sure it does not grow at every iteration. Otherwise you need to identify where it links to the previous operations, and use .detach() to break it at this point.

Thank you!
To test lists/buffer problem, I disable my RNN unit. After that, memory usage looks right.
Now I know where the problem is.

I wrote a (1) simple LSTM class and the memory usage is unusual.
However using (2) LSTM directly without writing a LSTM class to use will not cause problem.

What makes the different from using sequential class for RNN and directly use RNN?

Here is the code for (1) and (2):

Edit: I found the way I provide code is hard to read, so I put a Github link instead.

After periodically detach hidden layers in rnn, problem is solved.
Thank you!

1 Like