Tracking down a suspected memory leak

Could be a 2.7 vs. 3.5 thing then. I’m using cuda 8 and cudnn 5.1 as well and a 2 weeks old git pytorch. Would you have the opportunity to check with python 3.x as well?

I get the increasing memory usage also for python 2.7.

100MB extra after 15 epochs sounds reasonable. It can be caused by memory fragmentation. 45GB increase sounds bad.

So what would you expect in terms of scaling behaviour?

These were 15 play epochs in the test case. With the thing I’m trying to run I have larger epochs (but similar batch size) and get 2GB increase after 30 epochs and no sign that it stabilizes. When I run it over night, it gets OOM-killed.

But as it’s working just right for most people, I’ll try to figure out some more what is going on.

@tom I suspect it’s caused by conda. Did you try to install pytorch using pip?

@donglixp Thanks. I don’t use conda. Apparently the behaviour is better with the torch-provided pip-installed package than when I compile from source. I have not found out why yet.

I’ve tried playing around with malopt(3) and alternative malloc (jemalloc, tcmalloc), but that was actually worse.

2 Likes

I’ll try to run your snippet and see if I can reproduce it.

To give an update here: When running the same python 3.5 and pytorch under valgrind and with PYTHONMALLOC=malloc and leakcheck=full as described in the python source tree’s README.valgrind, the memory use increases by only .3 MB per Epoch which I would consider good enough. I’m not quite sure what to think of that except that apparently pytorch triggers some unfortunate interaction between the python memory management and the standard malloc… :frowning:
If valgrind would not slow things down to much to be used in production, I’d just always call python that way…

I too am experiencing a pretty bad leak (20G after like 100 epochs of CIFAR10) with pytorch built from source, including the current autograd branch. The pip-installed version doesn’t have this leak. I’m just using standard VGG16 with dropout.

I tried tracking it using pympler, following this. At every epoch the only diff I get are some int/str/list of relatively small size. Might the leaks originate from pytorch’s C modules? It looks like I’m getting the same leak from a lua version of the same network.

Would be nice if someone could provide an overview as to how memory allocation is done in pytorch and THx modules so that we can help tracking it down.

Hello,

apologies for warming this up over and over again. Have you changed how you build things?

I’m asking, because I tried to start A/B testing by setting up a more controlled build environment yesterday.
During the course of the testing, I upgraded the (previously non-memory-leaky) pip-installed version and it seems that yesterday’s pip version (torch-0.1.11.post5-cp35-cp35m-linux_x86_64.whl as well as 3.6) does show the memory-leaking behaviour.

I will try to cut some of the memory allocation things that are compiled in by default, but I have not gotten that far yet.

Best regards

Thomas

Hi,

after a night of trying to interpolate between the leaky and the non-leaky version: It appears to boil down to which libcudnn.so version is loaded by torch/backend/cudnn/init.py .
I switched back to the 0.1.10 mechanism of finding the library.
Then if it finds the 5.1.5 libcudnn, I have < 1MB per Epoch, for 6.0.5 or 6.0.20 it seems more than 5MB per epoch on average.
(all the time, the extensions were linked to cudnn 6.0.5).

Best regards

Thomas

@tom does it still leak memory if you add this:

torch.backends.cudnn.enabled = False
3 Likes

@apaszke
No. So it seems that the backend.cudnn leaks memory if used with cudnn 6 (as opposed to 5) on my machine.
What I would want to find out is whether that is because of cudnn 6 itself or because of how it is used.

I have hacked-up a wrapper intercepting torch.backend.cudnn.lib calls, the call statistics are listed below.
I would probably attempt to make up something that does these calls in isolation (but I know nothing about cuda programming, so I’m not sure how soon this might be).
Out of curiosity, has reproducing the problem become easier now that torch.backend.cudnn does not prefer system cudnn5 to the compiled in one? (One of the unexpected - to me - aspects of the old 0.1.10 searching code was that it looked by path and the by version, so the main preference was the path of the so - so everyone who had cudnn5 system-installed and did not upgrade to cudnn6 essentially used cudnn5 for torch.backend.cudnn).

Best regards

Thomas

count   function number of args
      1 cudnnCreate 1
      4 cudnnCreateDropoutDescriptor 1
 264000 cudnnCreateFilterDescriptor 1
   8000 cudnnCreateRNNDescriptor 1
3232000 cudnnCreateTensorDescriptor 1
      1 cudnnDestroy 1
      4 cudnnDestroyDropoutDescriptor 1
 264000 cudnnDestroyFilterDescriptor 1
   8000 cudnnDestroyRNNDescriptor 1
3232000 cudnnDestroyTensorDescriptor 1
 256000 cudnnGetFilterNdDescriptor 6
 128000 cudnnGetRNNLinLayerBiasParams 9
 128000 cudnnGetRNNLinLayerMatrixParams 9
   8000 cudnnGetRNNParamsSize 5
   8000 cudnnGetRNNTrainingReserveSize 5
   8000 cudnnGetRNNWorkspaceSize 5
      1 cudnnGetVersion 0
   8000 cudnnRNNForwardTraining 21
      4 cudnnSetDropoutDescriptor 6
   8000 cudnnSetFilterNdDescriptor 5
   8000 cudnnSetRNNDescriptor 8
3232000 cudnnSetTensorNdDescriptor 5

Tom,

If you try to see what cudnn calls are used, instead of a hacked-up wrapper you can use ltrace.
ltrace -x ‘cudnn*’ -l libcudnn* -f python my_script.py

(-f is to track calls from child processes, as pytorch spawns them for backwards). We’ll try to repro and take a look. Does the code pasted at your gist reproduce the problem?

Hello @ngimel,

thank you!

I have tried to turn to the cudnn RNN example, but it seems I manage to use it in the same suboptimal way:

https://devtalk.nvidia.com/default/topic/1002475/gpu-accelerated-libraries/cudnn6-example-with-without-bidirectional-lstm-and-memory-use/

Best regards

Thomas

P.S. I’m not entirely sure I get ltrace output. Maybe it is not quite compatible with ctypes.

I don’t see your post at devtalk. ltrace is not listing the arguments correctly, because you don’t have .ltrace.conf, but it does list all the cudnn calls that are made.
I can repro memory leak with bidir LSTM and cudnn 6.0.20 (in my test, it leaked ~200 MB in 20 epochs), with unidirectional the leak is much smaller, on the order of 20 MB in 20 epochs.

Hello @ngimel,

Ah, cool, that you can reproduce it. Thank you for putting in the effort!

The post at devtalk appears to only be visible when I log in, but it seems to say 9 views. I can’t blame NVIDIA for not letting noobs like me post stuff visible to the world. :slight_smile:
I think the 1MB per epoch might be some form of “overhead” from however malloc happens, it’s the 10MB per epoch I’d like to get rid of…

If it is more comfortable for you, I can send you the RNN_example.cu that I believe shows the same effect.

Best regards

Thomas

The leak was finally fixed by reinstalling pytorch using pip

Hi XingxingZhang,

“My” memory leak was in cudnn 6 but not 5, it did not really depend on the (py)torch except in what cudnn version the package happened to use.

Best regards

Thomas

I’ve been having exactly the same problem. Whenever I run a GRU/LSTM/RNN in the GPU (this seems not to happen on CPU), the RAM consumption keeps increasing, roughly 1MB every 10s. Even a simple model with just the RNN has this problem.

Calling the garbage collector didn’t do anything, but this solved the problem for me:

So thanks @apaszke! Is this issue going to get fixed in the next update? Because the previous workaround is cool, but it is a workaround. Have you been able to track down the problem?

2 Likes