Debugging gpu memory leaks might be tricky with debugger

stas · November 15, 2020, 4:47am

While debugging a program with a memory leak I discovered that the leak was bigger when I was using pycharm debugger. I haven’t compared this to other debuggers but there was a definite much larger gpu memory consumption.

I tried a whole bunch of debugger settings, including “on Demand” but none seem to make a difference.

(note: This post has been edited to add this clarification - as I originally blatantly blamed the pycharm debugger, but as @googlebot pointed out in his comments it could be just as well the case with any other debugger. )

the original post follows:

Is anybody using pycharm debugger with pytorch programs? it leaks gpu memory like there is no tomorrow. This sounds similar to the problem with the backtrace on OOM in ipython not freeing gpu memory.

Unknowingly I made the mistake of actually trying to debug memory leakage using pycharm, not realizing that pycharm debugger itself is storing tensors and not freeing them, even with forced gc.collect(). I suppose this is by design since pycharm stores all the variables for the user to access and thus they can’t be freed until the frames are exited.

I discovered that while writing a script to reproduce a memory leak, and only when I added a gpu memory tracing in it using I noticed that that I was getting totally different measurements when running the same script under debugger and not.

I tried a whole bunch of settings, but none seem to help. If you have a way to tell pycharm to not store intermediary vars please share, but somehow I doubt it’s even possible.

So if you’re trying to debug a pytorch program under pycharm and you end up getting OOM, this is why.

googlebot · November 15, 2020, 7:44am

Use “Variable loading policy” = on demand.
IPython’s history may keep tensors alive (underscore variables like _, _10). In practice, I almost never have issues caused by that, but maybe your console usage patterns are different. So, disabling IPython may do something (I don’t know how to disable history only).

If you don’t pause or use breakpoints, I don’t see how pycharm would allocate cuda memory.

stas · November 15, 2020, 4:42pm

Thank you for your follow up, @googlebot

Use “Variable loading policy” = on demand.

As I mentioned I tried many different options, including this one - to no avail.

Python’s history may keep tensors alive (underscore variables like _, _10)

No, this has to do with ipython not releasing GPU ram on OOM - a huge problem for jupyter users. A fix has been proposed almost 2 years ago, but it has never been integrated:

github.com/ipython/ipython

fix a memory leak on exception (caused by the stored traceback)

ipython:main ← stas00:leak-on-exc

opened 01:54AM - 22 Jan 19 UTC

stas00

+2 -0

Currently ipython stores the traceback on exception. The problem is that it crea…tes a circular reference to locals() in each frame of the traceback. And until next exception arrives gc.collect() will fail to collect. If locals() happened to be huge variables, like 8GB of GPU RAM, the user has no choice but to restart ipython/jupyter to recover. Solution - cleanse the saved tb from any references to `locals()` by running it through: `traceback.clear_frames(tb)` My fix was made to only one such location in the ipython code and there is a dozen other places where it does that. So please kindly review for other similar situations. I wrote a test case that demonstrates the problem and putting this fix in that particular location fixed that leak. But I don't know whether they are other situations where a similar fix needs to be applied. You can see the test case here (run it with Alt-F so it runs all cells ignoring the errors) https://hub.mybinder.org/user/stas00-fastai-misc-xbrsnjae/notebooks/debug/ipython/locals_leak_on_exc.ipynb It shows how if there is a local variable that is of 128MB in size, an exception involving that function will leak that much RAM. And it shows how the next exception will release the previously held locals() allowing gc.collect() to recover that leaked memory. The source of the test case is here: https://github.com/stas00/fastai-misc/blob/master/debug/ipython/locals_leak_on_exc.ipynb and here it is in straight ipython's output: ``` In [1]: import numpy as np In [2]: def consume_cpu_ram(n): return np.ones((n, n)) ...: def consume_cpu_ram_128mb(): return consume_cpu_ram(2**12) In [3]: import gc, os, sys, time, psutil ...: process = psutil.Process(os.getpid()) ...: def cpu_ram_used(): return process.memory_info().rss In [4]: def fail(): ...: x = consume_cpu_ram_128mb() ...: raise ValueError("Ouch") ...: In [5]: before = cpu_ram_used() ...: fail() --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-5-3b8b03759f74> in <module> 1 before = cpu_ram_used() ----> 2 fail() <ipython-input-4-e846ddbe1c18> in fail() 1 def fail(): 2 x = consume_cpu_ram_128mb() ----> 3 raise ValueError("Ouch") 4 ValueError: Ouch In [6]: gc.collect() ...: after = cpu_ram_used() ...: print(f"Difference { (after-before)/2**20}") Out[6]: 1105 Difference 128.62890625 In [7]: # force ipython to reset its %tb ...: raise ValueError("Reset") --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-7-31374e7d7384> in <module> 1 # force ipython to reset its %tb ----> 2 raise ValueError("Reset") ValueError: Reset In [8]: gc.collect() ...: after = cpu_ram_used() ...: print(f"Difference { (after-before)/2**20}") Out[8]: 418 Difference 0.77734375 ```

I use 1/0 cell-fix following the oom cell to work around it.

If you don’t pause or use breakpoints, I don’t see how pycharm would allocate cuda memory.

Right, basically you’re saying do not use pycharm debugger.

It’s not allocating cuda memory - it prevents variables from being freed and gc.collect()ed and thus memory from being freed.

googlebot · November 16, 2020, 2:57am

Your message sounded like the debugger is somehow defective, which is not the case in my experience. IPython/jupyter’s leaks would happen regardless of pycharm or other debugger IDE, won’t they?

stas · November 16, 2020, 5:00am

You’re correct that I made a broad statement, without comparing to other debuggers. I appreciate you flagging that, @googlebot. I edited the first post to reflect that.

While there is definitely a similarity I don’t think we can compare this to jupyter/ipython. In ipython you can control what variables remain in scope, whereas debuggers do their own tracking that you can’t control.

Unfortunately, I haven’t saved that particular code that was showing drastically different gpu memory usage w/ and w/o debugger. I tried a few approaches now and the only correlation I found is the number of breakpoints in frames that had huge vars on cuda - more extra memory used when more breakpoints.

I may get a chance to investigate it more and if I do I will report back. But I’m surely going to be wary of watching memory usage under debugger and will have to check the usage patterns outside of debugger.