Can't find source of memory leak

Hi All,

I have a small model (2M params), and I’m using batch = 1. The size of every batch varies, but in average, I’m using 5 gb per iteration for the first epoch.

My problem is that after the first epoch, my memory consumption increases until it hits an OOM. I’m suspecting this due to a memory leak, so I tried the following fixes:

  • add torch.cuda.empty_cache() after each memory heavy operation
  • delete my mini batch data and loss after each iteration
  • detach and convert to cpu all log data

but this didn’t seem to solve my issue!

Is there something I’m missing? can you help me solve this issue?
Thank you in advance!

I would check if it is roughly linear per iteration or something else by printing the PyTorch allocated memory after every iteration.
If it is a leak, it should likely be linear.

I @tom
Thanks for your response.
It’s not linear, and this is because the size of every mini-batch is different, I’m using graphs, and every graph has a different number of nodes!

Well, if you have to, you could run the same graph over and over again.