CUDA memory continuously increases when net(images) called in every iteration

Kalamaya · February 15, 2017, 12:55am

Hi, I have a very strange error, whereby, when I get by outputs = net(images) within every iteration in a for loop, the CUDA memory usage keeps on increasing, until the GPU runs out of memory.

The weird situation is that if I have this loop inside a function, it causes this issue. If I just have the contents of the function sitting in my normal script, it works just fine. What may be the cause of this??
Thanks

ruotianluo · February 15, 2017, 4:30am

This is because pytorch will build a the graph again and again, and all the intermediate states will be stored.

In training, the states will be cleared if you do backward.
However, during test time, you could use Variable(xxx, volatile=True). I don’t know if it’s your case, because of the weird situation you have.

A good reference is this Understanding graphs and state.

apaszke · February 15, 2017, 3:42pm

It’s hard to guess what’s happening, you’re probably holding on to the output, or to the loss for too long. If you’re accumulating the losses over multiple batches, use loss.data[0].

Kalamaya · February 15, 2017, 8:41pm

I think that did it! Ugh! I spent two days on this! >< Thanks though… so what was going on here?

Kalamaya · February 16, 2017, 2:22am

@ruotianluo Thanks - if I understand you correctly, are you saying that, since I am accumulating a loss, that is composed of:

for i in xrange(100):
    out = net(input) 
    loss = someLossFunction(out)

, AND, since the loss here is a Torch Variable, that this will cause the graph to be built over and over again? Have I understood you correctly?

ruotianluo · February 16, 2017, 2:34am

The graph won’t be re-made. The 100 loops basically create 100 graphs. The 100 graphs share the same input and parameters, but all the intermediate variables of 100 graphs (although they could be the same) are saved separately.

Kalamaya · February 16, 2017, 4:15am

Does this mean that simple having Variable called in a loop will cause the graph to be re-made? This is the part that I am not getting… thanks.

smth · February 16, 2017, 4:30am

for i in xrange(100):
    out = net(input)
    loss += someLossFunction(input) # BAD, because it keeps continuing the graph over the for-loop

    loss = someLossFunction(input)   # this is fine

    loss = someLossFunction(input)
    total_loss += loss.data[0]              # this is fine

Kalamaya · February 16, 2017, 4:36am

Thanks @smth, I think I get it now, since the loss is a variable, it will keep making the graph longer and longer, connecting an entire (?) graph over and over again… I think.

Im just asking for the sake of understanding: Is what is happening, let me put it differently: Are you saying that this statement here, will make two graphs that are identical to each other?

loss = someLossFunction(input1) + someLossFunction(input2)

Is my conclusion correct?

smth · February 16, 2017, 4:38am

yes, your conclusion is correct.

spro · June 1, 2017, 11:27pm

To clarify for RNN users that might come across this, we do actually want to keep the graph around for backprop-through-time. (Right? Or is there a better way?)

smth · June 16, 2017, 1:05pm

we do want to keep the graph around for BPTT, but you have to call .detach() every BPTT steps, so that the graph doesn’t keep growing infinitely.

heihei · August 16, 2017, 8:02am

This should be written in the tutorial!

hmishfaq · August 18, 2017, 8:55pm

@smth, could you elaborate on the .detach() usage?

11112 · August 27, 2017, 5:37am

hi, can the total_loss here be back propagated?

Felix_Lessange · January 15, 2018, 10:28am

this solved my problem

ddeng · October 26, 2018, 2:16pm

This solved my problem too. I didn’t know call loss instead of loss.data[0] will cause rewriting the graph.

zhaoethz · November 13, 2018, 5:32pm

This is really important. Everyone learning PyTorch should know this at the first place.

Albert_Christianto · November 30, 2018, 9:17am

Hello, everyone.
Beside this problem, what else could increasing the GPU memory (leak GPU memory)?
Because I have fixed my loss function code, but it still not worked.
thanks,
Albert Christianto