CPU Memory leak but only when running on specific machine

AlphaBetaGamma96 · March 25, 2021, 5:59pm

Hi all,

I’m running a model and I’ve noticed that the RAM usage slowly increases during the training of the model. It’s around 200mb per epoch, but over time it fills up all the RAM on my machine which eventually leads the OS to kill the job. However, the strange thing about this is it’s only when running on a specific machine. I’ve run the same code on a different machine and there’s no memory leak whatsoever.

The difference between the two machines is one is running PyTorch 1.7.1 with cuda 10.2 (the machine without the memory leak) and the other machine (the one with the memory leak) is running PyTorch 1.8.1· Both are run with conda and only on the CPU. I have tried using older versions of PyTorch on the machine with the memory leak, but the memory leak still exists so I doubt it’s due to a PyTorch version.

I’ve been using psutil to monitor the RAM usage on the CPU and I’ve been using the tracemalloc package to print out snapshots during the training loop to see how the memory usage changes from one epoch to the next. Yet when printing out these differences there isn’t any within my code that matches anywhere near the 200mb RAM increase…

An example of this would something like…

Epoch     25/  1000 Loss 1428.8508 RAM:  9.910GB
~/Model.py:46: size=1136 B (-96 B), count=2 (+0), average=568 B
~/utils.py:14: size=1040 B (+24 B), count=8 (+1), average=130 B
~/calc_loss.py:79: size=2128 B (+0 B), count=24 (+0), average=89 B
~/calc_loss.py:78: size=2056 B (+0 B), count=23 (+0), average=89 B
~/utils.py:6: size=1920 B (+0 B), count=21 (+0), average=91 B
Epoch     26/  1000 Loss 1426.0033 RAM: 10.254GB
~/Model.py:46: size=1232 B (+96 B), count=2 (+0), average=616 B
~/utils.py:14: size=1016 B (-24 B), count=7 (-1), average=145 B
~/calc_loss.py:79: size=2128 B (+0 B), count=24 (+0), average=89 B
~/calc_loss.py:78: size=2056 B (+0 B), count=23 (+0), average=89 B
~/utils.py:6: size=1920 B (+0 B), count=21 (+0), average=91 B
~/Layers.py:71: size=992 B (+0 B), count=11 (+0), average=90 B
Epoch     27/  1000 Loss 1436.8241 RAM: 10.606GB
~/utils.py:14: size=1040 B (+24 B), count=8 (+1), average=130 B
~/calc_loss.py:79: size=2128 B (+0 B), count=24 (+0), average=89 B
~/calc_loss.py:78: size=2056 B (+0 B), count=23 (+0), average=89 B
~/utils.py:6: size=1920 B (+0 B), count=21 (+0), average=91 B
~/Model.py:46: size=1232 B (+0 B), count=2 (+0), average=616 B
~/Layers.py:71: size=992 B (+0 B), count=11 (+0), average=90 B
Epoch     28/  1000 Loss 1428.6560 RAM: 10.968GB
~/calc_loss.py:79: size=2128 B (+0 B), count=24 (+0), average=89 B
~/calc_loss.py:78: size=2056 B (+0 B), count=23 (+0), average=89 B
~/utils.py:6: size=1920 B (+0 B), count=21 (+0), average=91 B
~/Model.py:46: size=1232 B (+0 B), count=2 (+0), average=616 B
~/utils.py:14: size=1040 B (+0 B), count=8 (+0), average=130 B
Epoch     29/  1000 Loss 1435.2988 RAM: 11.321GB
~/calc_loss.py:79: size=2128 B (+0 B), count=24 (+0), average=89 B
~/calc_loss.py:78: size=2056 B (+0 B), count=23 (+0), average=89 B
~/utils.py:6: size=1920 B (+0 B), count=21 (+0), average=91 B
~/Model.py:46: size=1232 B (+0 B), count=2 (+0), average=616 B
~/utils.py:14: size=1040 B (+0 B), count=8 (+0), average=130 B
~/Layers.py:71: size=992 B (+0 B), count=11 (+0), average=90 B

The information printed out inbetween the lines showing the current Epoch are created via the use of this function

def compare_snaps(snap1, snap2, limit=50):
  top_stats=snap1.compare_to(snap2, "lineno")
  
  for stat in top_stats[:limit]:
    line=str(stat)
    if("~/" in line): #filter only lines from my own code
      print(line)

this function takes two snapshots from tracemalloc.take_snapshot() from the current epoch and the previous epoch and compares how the memory usage changes. It takes the top 50 memory intensive operations, and filters only the ones that I’ve written (i.e. it excludes any changes from within anaconda3) and prints these changes to the screen. As can be seen, the changes in memory is negligible. In fact, when comparing the snap shotoutput from both machines, they’re near identical.

It seems really weird that PyTorch code would have a memory leak on one machine and not on another… This memory leak issue seems to be outside of my skill-set, so any help would be appreciated.

Thank you!