Tracking down a suspected memory leak

Hello,

thank you for pytorch! Would you have a hint how to approach ever increasing memory use?

While playing around with the (very cool, thanks Sean!) deepspeech.pytorch , I notice that the (RAM, but not GPU) memory increases from one epoch to the next.

Being on python 3.5 I used tracemalloc to see where memory is allocated. To my surprise, I am seeing the forward pass (“out = model(inputs)”) as one of the top allocators for each cycle.
Is it expected that the forward pass accumulates memory?
I there a standard way how to find out where exactly that happened (I have pasted the backtrace of the main allocation that comes up when comparing allocation between to epochs below)?
I would appreciate any hint or pointer.

Thank you!

Thomas

P.S.: I have adapted the script to set a few defaults differently and use a different dataset, but did not do much with the actual calculation (and I’d be happy to provide my changes if they are of interest). I am using a git checkout of github pytorch master from today.

  File "/usr/local/lib/python3.5/dist-packages/torch/backends/cudnn/__init__.py", line 387
    return array_type(*itr)
  File "/usr/local/lib/python3.5/dist-packages/torch/backends/cudnn/__init__.py", line 160
    int_array(tensor.size()), int_array(tensor.stride())))
  File "/usr/local/lib/python3.5/dist-packages/torch/backends/cudnn/__init__.py", line 398
    descriptor.set(tensor)
  File "/usr/local/lib/python3.5/dist-packages/torch/backends/cudnn/rnn.py", line 242
    fn.cy_desc = cudnn.descriptor(cx) if cx is not None else None
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/_functions/rnn.py", line 269
    cudnn.rnn.forward(self, input, hx, weight, output, hy)
  File "/usr/local/lib/python3.5/dist-packages/torch/autograd/function.py", line 224
    result = self.forward_extended(*nested_tensors)
  File "/usr/local/lib/python3.5/dist-packages/torch/autograd/function.py", line 202
    flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/_functions/rnn.py", line 327
    return func(input, *fargs, **fkwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/rnn.py", line 91
    output, hidden = func(input, self.all_weights, hx)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 202
    result = self.forward(*input, **kwargs)
  File "deepspeech.pytorch/model.py", line 48
    x, _ = self.rnn(x)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 202
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 64
    input = module(input)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 202
    result = self.forward(*input, **kwargs)
  File "deepspeech.pytorch/model.py", line 94
    x = self.rnns(x)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 202
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 59
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 202
    result = self.forward(*input, **kwargs)
  File "train_de.py", line 139
    out = model(inputs)

It seems the memory-consumption increases if bidirectional is passed to the nn.LSTM instantiation. I have not found out why, though.

1 Like

I’ve seen this too (memory-consumption increases if bidirectional is passed) - but I can’t remember (no pun intended) how I fixed it? There is a way to get it to work though, as I’ve been using bidirectional’s for the last few weeks!

Thanks. If you have gotten nn.LSTM it to work, I can stop trying to stare down nn._functions.rnn .

Should you remember what you did to make it later, I’d appreciate a hint.

Memory consumptions seems to be often stable over a few tens of mini-batches, but then continues increase and not bounce back after an epoch.
I have tried various things like keeping the input size greater than the hidden size (times two for bidirectional), but to no avail.

Hi @tom if I remember correctly, I worked by adding tiny bits of code to a basic working (i.e. non memory exploding) program. I’d tried starting with something complicated and adding a bi-directional LSTM but that just blew up. So did it the other war around and started with a very simple LSTM skeleton.

I hope this helps?

@tom can you try adding gc.collect() calls here and there? Maybe there’s some problem with reference cycles.

@apaszke thanks for the hint, I tried that but it didn’t help. I’ll start fresh with a new model and see whether it goes wrong again. If it does, I’ll be closer to having a minimal test case to share.

Hi @tom @apaszke

I am training a dialog system with opennmt-py https://github.com/OpenNMT/OpenNMT-py
My dataset contains 26,265,224 sequence pairs (available here https://github.com/jiweil/Neural-Dialogue-Generation)

I observed similar problems: GPU memory consumption remains constant (2.9G), but CPU RAM is almost eaten up.
Initially the cpu memory consumption is around 27G (my dataset is large and it’s acceptable), but after 38.5 hours (around 3.4 epochs) of training, it becomes 53G.

Our main difference is that I am using a unidirectional LSTM. Since I am training models on GPU rather than CPU, but with increased memory consumption on CPU. There might be something to do with the dynamic computational graph, which I believe is created on CPU. It is likely that after each batch of training, the dynamic graph is not destroyed properly. I am not sure if it is because the maximum sequence length are different from each batch and pytorch has to create a new graph per batch.

Oh, here is my torch verison

>>> import torch
>>> torch.__version__
'0.1.10+ac9245a'

Hello @XingxingZhang

similar to what you describe, my issue is with CPU RAM, too, GPU memory consumption is constant.
With the unidirectional LSTM, @apaszke 's hint works for me.
In more detail, you need to “import gc” at the top and call “gc.collect()” every epoch or so. (I don’t know whether it is always needed, but that worked for me.)
My impression was that the OpenNMT version in https://github.com/pytorch/examples has some fixes that the OpenNMT does not have yet.

Best regards

Thomas

Ugh, can you please get us a smallest possible snippet that still causes the memory to blow up? That’s the only way we can find and fix these things

Hi @tom

I have added “gc.collect()” in my code and let it run for one or two days to see if I still have the problem.

cheers,
Xingxing

If anyone is concerned about bi-directional LSTM and memory leak issues in pytorch - maybe a good place to start coding a new project is from the already working tutorial?

https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/07%20-%20Bidirectional%20Recurrent%20Neural%20Network/main.py

Ah, thanks. I was going to post a movie sentiment net, but that has the drawback of having the embedding layer.

So here is what I did:
I took this example, added the following helper:

# helper function to get rss size, see stat(5) under statm. This is in pages (4k on my linux)
def memory_usage():
    return int(open('/proc/self/statm').read().split()[1])

then I changed the number of epochs to 100 sprinkled gc.collect() and printed the memory_usage() after each epoch.
For me (on python3.5 from Debian with pytorch 0.1.10+821656d from Friday) it added about 1 megabyte of memory per epoch (from ~1.7g RSS to ~1.8g) with the 100 epochs.

I saw similar effects with the sentiment analysis one, but I’m not sure I’ll keep saying that unidirectional LSTM always work…

I pasted the code at https://gist.github.com/t-vi/ea01a032faa1beefc78531e6e292add5
@apaszke is that approximately what you had in mind as a test case?

Kind of, but the leaks mentioned earlier were much larger, and the larger the leak the easier it is to find it :slight_smile: This doesn’t sound too bad, it can be because of how the libc allocator works.

@apaszke Right. Thank you for your patience with me here!

So here is a small thing hitting 100 MB extra memory consumption after 12-18 “epochs” with bidirectional LSTM and not without.

I don’t think it is the contiguous call + summation as it also happens when I double the input dimension of the 2nd-4th LSTM layer instead.

Hi @tom

unfortunately, “gc.collect()” didn’t fix the leak in my program. It takes less RAM in the 3rd epoch, but it is still around 45G.

I will look into it again.

I think in your test code here https://gist.github.com/t-vi/60515a24e1cbc4dc87897a6d8c224698
The uni-LSTM is also leaking memory, but the leak is much smaller than bi-LSTMs

@apaszke, is the example clean enough to file it as an issue or do you think it would need more work?

I tried to run LSTM-bidir-memory-consumption.py (attached by @tom) on our server. The memory consumption is quite stable.

DummyModule (
(rnn1): LSTM(400, 400, bias=False, batch_first=True, bidirectional=True)
(rnn2): LSTM(400, 400, bias=False, batch_first=True, bidirectional=True)
(rnn3): LSTM(400, 400, bias=False, batch_first=True, bidirectional=True)
(rnn4): LSTM(400, 400, bias=False, batch_first=True, bidirectional=True)
)
at epoch 0: consuming extra 0.0 MB
at epoch 1: consuming extra 0.0 MB
at epoch 2: consuming extra 0.0 MB
at epoch 3: consuming extra 1.0 MB
at epoch 4: consuming extra 1.0 MB
at epoch 5: consuming extra 1.0 MB
at epoch 6: consuming extra 1.0 MB
at epoch 7: consuming extra 1.0 MB
at epoch 8: consuming extra 1.0 MB
at epoch 9: consuming extra 1.0 MB
at epoch 10: consuming extra 1.0 MB
at epoch 11: consuming extra 1.0 MB
at epoch 12: consuming extra 1.0 MB
at epoch 13: consuming extra 1.0 MB
at epoch 14: consuming extra 1.0 MB
at epoch 15: consuming extra 1.0 MB
at epoch 16: consuming extra 1.0 MB
at epoch 17: consuming extra 1.0 MB
at epoch 18: consuming extra 1.0 MB
at epoch 19: consuming extra 1.0 MB
DummyModule (
(rnn1): LSTM(400, 400, bias=False, batch_first=True)
(rnn2): LSTM(400, 400, bias=False, batch_first=True)
(rnn3): LSTM(400, 400, bias=False, batch_first=True)
(rnn4): LSTM(400, 400, bias=False, batch_first=True)
)
at epoch 0: consuming extra 0.0 MB
at epoch 1: consuming extra 0.0 MB
at epoch 2: consuming extra 0.0 MB
at epoch 3: consuming extra 0.0 MB
at epoch 4: consuming extra 1.0 MB
at epoch 5: consuming extra 1.0 MB
at epoch 6: consuming extra 1.0 MB
at epoch 7: consuming extra 1.0 MB
at epoch 8: consuming extra 1.0 MB
at epoch 9: consuming extra 1.0 MB
at epoch 10: consuming extra 1.0 MB
at epoch 11: consuming extra 1.0 MB
at epoch 12: consuming extra 1.0 MB
at epoch 13: consuming extra 1.0 MB
at epoch 14: consuming extra 1.0 MB
at epoch 15: consuming extra 1.0 MB
at epoch 16: consuming extra 1.0 MB
at epoch 17: consuming extra 1.0 MB
at epoch 18: consuming extra 1.0 MB
at epoch 19: consuming extra 1.0 MB

Thank you for testing this, @donglixp. What python / pytorch / cuda / cudnn versions did you use?

python: 2.7
pytorch: latest version (installed using pip)
cuda: 8.0.44
cudnn: 5.1_8.0

hope this would help