Hi, I have a very strange error, whereby, when I get by outputs = net(images) within every iteration in a for loop, the CUDA memory usage keeps on increasing, until the GPU runs out of memory.
The weird situation is that if I have this loop inside a function, it causes this issue. If I just have the contents of the function sitting in my normal script, it works just fine. What may be the cause of this??
Thanks
This is because pytorch will build a the graph again and again, and all the intermediate states will be stored.
In training, the states will be cleared if you do backward.
However, during test time, you could use Variable(xxx, volatile=True). I don’t know if it’s your case, because of the weird situation you have.
It’s hard to guess what’s happening, you’re probably holding on to the output, or to the loss for too long. If you’re accumulating the losses over multiple batches, use loss.data[0].
The graph won’t be re-made. The 100 loops basically create 100 graphs. The 100 graphs share the same input and parameters, but all the intermediate variables of 100 graphs (although they could be the same) are saved separately.
for i in xrange(100):
out = net(input)
loss += someLossFunction(input) # BAD, because it keeps continuing the graph over the for-loop
loss = someLossFunction(input) # this is fine
loss = someLossFunction(input)
total_loss += loss.data[0] # this is fine
Thanks @smth, I think I get it now, since the loss is a variable, it will keep making the graph longer and longer, connecting an entire (?) graph over and over again… I think.
Im just asking for the sake of understanding: Is what is happening, let me put it differently: Are you saying that this statement here, will make two graphs that are identical to each other?
loss = someLossFunction(input1) + someLossFunction(input2)
To clarify for RNN users that might come across this, we do actually want to keep the graph around for backprop-through-time. (Right? Or is there a better way?)
Hello, everyone.
Beside this problem, what else could increasing the GPU memory (leak GPU memory)?
Because I have fixed my loss function code, but it still not worked.
thanks,
Albert Christianto