Tracking down a suspected memory leak

tom · July 13, 2017, 4:27pm

I haf tracked this down to a problem in libcudnn6 itself using their rnn example, which @ngimel kindly confirmed.
Unfortunately, something went wrong with the Nvidia forum post (it never went public) and the problem appears to persist in the latest cudnn update I tested.

I’m uncertain whether I am allowed to distribute my minimal test case under Nvidia’s license, but I’d be happy to provide a diff to the shipped example.

Best regards

Thomas

rasoolfa · July 21, 2017, 3:13am

FYI,
I had the same problem, using this flag solved my issue, nothing else worked. That said, “cudnn.enabled = False” causes significant performance degradation (i.e. codes run way slower) which I expected to see it to some extent but not that much.

pytorch: 0.1.12_2
cuda 8.0, [V8.0.61]
cudnn 5

Thanks.

rasoolfa · July 24, 2017, 9:12pm

I still see this problem keeps happening. I posted the issue here:

tom · July 24, 2017, 9:22pm

Note that “my” memory leak happens within cudnn as demonstrated by adapting the cudnn c example shipped by nvidia. It seems out of reach for pytorch to fix other than not using it.
As far as I understood, it is something funny in how Nvidia builds their releases.

Best regards

Thomas

rasoolfa · July 24, 2017, 9:31pm

I don’t think it is only cudnn, because when I tried using “torch.backends.cudnn.enabled = False”, this problem still happens.

James_PT · August 1, 2017, 3:17pm

This happens for me when using a CNN (no LSTM). Memory Usage/Leak

torch.backends.cudnn.enabled = False

does solve the problem for me, but unfortunately slows things down quite a bit.

jiesutd · November 11, 2017, 8:30am

@apaszke
I was tortured by the CPU memory leak for a long time too. The CPU memory is keeping increasing during training. Although the increasing speed is small, the CPU memory will be eaten up after 30 epochs.

torch: 0.2.0.3
cuda: 7.5
OS: ubuntu 16.04LTS

I tried many solutions (i.e. add gc.collcetion(), detach the hidden state of lstm, set volatile = True during decoding), but neither works.

After adding the following line, the memory leak disappeared. Thanks! But the training speed is also much slower than before (half of the previous speed).

torch.backends.cudnn.enabled = False

I am not familiar with the cudnn and the backend, could you please explain what is the problem when enabling the cudnn in backend? Is there any solution to fix the memory leak but keep the runing speed ?

Thank you very much! I have learned a lot from this post.

apaszke · November 11, 2017, 11:58am

Can you please open a GitHub issue with a minimal code example that would let us reproduce this? Thanks!

jiesutd · November 13, 2017, 11:26am

Hi @apaszke, I have opened an issue with minimal code. Here is the link: pytorch issue 3665

Thanks!