How to diagnose possible CPU memory leak?

huahuanZ · February 23, 2022, 4:08pm

The CPU memory just increases as my program running. Sorry that the codes is internally used so I can’t paste it. Here’s some information:

My program runs in inference mode, I set torch.no_grad() context. So it can’t be the computation graph memory “leak”.
The neural networks are small nn.LSTM and nn.Linear models. When I set batch size to small value (like 4 or 8), the memory usage is stable. But when batch size increase to larger (32, 128, …), the memory usage just increases over iterations.
I use psutil to get the memory usage. I also try tracemalloc and torch profile, but the letter two tools can’t tell where the leak lies.
My program can run on CPU or GPU device. The possible memory leak only occurs when using CPU.

The iteration loop is like


    with torch.no_grad():
        for minibatch in testloader:
            my_function(minibatch)
            mt = psutil.Process(os.getpid()).memory_info().rss
            print(mt)

I know without the detailed codes it’s hard to find the cause. Any suggestion is welcome .

ptrblck · February 24, 2022, 6:36am

As a general debugging advice I would probably try to remove utilities (e.g. checkpointing, quantization, etc.) if used and check if the effect is still visible. Once you could narrow it down to a small subset of your use case you might be able to post a proxy workload which we could use to help in isolating the issue.