Tips/Tricks on finding CPU memory leaks

Hi All,

I was wondering if there are any tips or tricks when trying to find CPU memory leaks? I’m currently running a model, and every epoch the RAM usage (as calculated via psutil.Process(os.getpid()).memory_info()[0]/(2.**30) ) increases by about 0.2GB on average. And I’m really not sure where this leak is coming from.

Are there any tips or tricks for finding memory leaks? The only thing that comes to mind for me is enabling torch.autograd.set_detect_anomaly(True) at the top of my script

Any help would be appreciated!

You could try to use e.g. valgrind to find memory leaks in an application.
Note however, that this would find real “leaks”, while users often call an increase of memory in PyTorch also a “memory leak”. Usually it’s not a real leak, but is expected due to a wrong usage in the code, e.g. storing a tensor with the complete computation graph in a container (e.g. list etc.). Tools to find real leaks won’t help you here and I’m unsure if you are trying to debug a real leak or an increased memory usage caused by your script.

Hi @ptrblck, thanks for the quick response! My CPU memory leak here is quite weird to explain, I did briefly expand upon this topic with a new, more specific topic (CPU Memory leak but only when running on specific machine). Which I could merge together with this topic if that makes more sense?

But in short, when I run my code on one machine (let’s say machine B) the memory usage slowly increases by around (200mb to 400mb) per epoch, however, running the same code on a different machine (machine A) doesn’t result in a memory leak at all. I’ve just rechecked this now to make sure I’m not doing some stupid but I’ve ran the same code (code extract from the same .tar.gz file) on two different machines and I get one machine leaking memory and the other one doesn’t. I’ve attached a graph to give a more visual comparisonmemory_usage_comparision

Given both machines are running the same scripts, the only difference I can think of are environments and perhaps OS? Machine A is running Ubuntu 18.04 in a Conda environment (with pytorch 1.7.1 and cuda 11) Machine B is running Ubuntu 20.04 in a pip environment (with PyTorch 1.8.1+cpu). I did try running Machine B with a conda environment but I got a similar memory leak. Assuming it is an actual leak and not something within my script that stores a computational graph within a container (like you said) but if that were the case, surely it’d happen across both machines?

Thank you for the help! :slight_smile:

I would start the debugging by installing the same PyTorch version (install 1.8.1 on both machines), as this difference is I think the most likely root cause of it. Of course the OS and different machines might also cause (unknown) issues, but lets try to narrow it down by using the same software stack first.

Do you think running the code within pip on one machine and on conda on the other machine, might be an issue as well? I think there’s a way to save conda environment and transfer them between machines. What I’ll do as I wait for your response on that is upgrade both machines to 1.8.1 on pip/conda and rerun the scripts! Thank you!

Yes, it could also be an issue with a specific release in either conda, pip or both.
Note that you can also create new conda environments and install the 1.8.1 release there, which would keep your current environments (if that’s needed).
Nevertheless I would generally recommend to update to the latest stable release to get the most features and bug fixes.

1 Like

Hi @ptrblck! So, I’ve updated both machines to run 1.8.1 (and both within a conda environment!) The only differences are OS and CUDA installation. Machine A has 11.0 Cuda installed whereas Machine B is cpu install only! I don’t think that would cause an error? I instantiate all my class at the beginning of the code and fix them to a device there, and there only. So, I don’t think there could be an issue with doing net = net.to('cpu')? The leak still exists, so it seems less a difference in PyTorch and perhaps something else that I’m missing.

Edit: I’ve also been tracking the number of objects within the script via gc.get_objects() and the number of objects increases after the 1st epoch, but it’s constant from the 2nd epoch onward. This behaviour is the same on both machines, although the actual number of objects is slightly lower on the machine that has the leak.

Since the memory leak seems to be caused on the host memory (if I understand the issue correctly), I don’t think the installed CUDA toolkit version matters here.

Yes, that is concerning.

Could you post the machine details (if possible) of the “leaking” machine?
Are you running the workload in a container or on bare metal?
Would it also be possible to get an executable code snippet to debug this issue?
Were you able to check for memory leaks with valgrind?

1 Like

When you say “machine details” do you mean like hardware? Or software?

Do you mean like a docker container? If so, no I’m just running this on a desktop with the conda evironment stated previously. Just straight from command line with python3 main.py

I’m trying to think of the best way to get a coding snippet working so you can reproduce the error. The only issue is the code is spread over a few files and all brought together within a main script. While waiting for your response, I did go through the code and rewrote it up (from scratch) to remove any potential memory issue that I could see being an issue and the ‘new’ version still has the existing memory leak - so I’m a little confused still!

To add to the confusion, I played around with the size of my network and the batch size to see if that affected the memory leak. Interestingly enough, I’ve found one thing that may correspond to the leak. To give a brief overview, my network is a feed-forward network and takes N inputs and returns 2 outputs (sign and logabsdet from torch.slogdet), and is used to calculate a scalar loss value which I subsequently minimize. However, I’ve noticed that if I have N=8 (for the network input) the memory leak seems to go away and the memory usage just fluctuates around 0.5 Gb but if I have say N=12 the memory leak is present and increases by around 0.1GB per epoch.

I haven’t been able to check with valgrind yet, I’ve only use valgrind once and that was to debug some Fortran95 code so I’m not 100% sure that would interface with an interpretative language like Python. I can have a look online and see how to use it!

Thank you for the help! :slight_smile:

I think the used CPU and OS would be a good starter.

Interesting findings, which might help debug it further.

Please let us know once you can share some code.

The CPU is a Intel® Core™ i7-10700K CPU
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 16
Virtualisation: VT-x

Any other specific information? Or should I just print the entire lscpu command?

OS info

NAME=“Ubuntu”
VERSION=“20.04.2 LTS (Focal Fossa)”
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME=“Ubuntu 20.04.2 LTS”
VERSION_ID=“20.04”

I’ll get to work on writing up the code (with some comments) and I’ll post a github link when it’s ready!