Tips/Tricks on finding CPU memory leaks

Shang_Lee · May 24, 2024, 8:41am

I have met the same problem, memory grows after every train batch. But i didn’t try on gpu.
I found the reason: I have a global variable loss_sum, use to accumulate loss and show historical average loss periodly. Like “loss_sum += batch_avg_loss”, these may cause torch to record gradient-relative variables for loss_sum and keep them in memory I guess. So to cut the gradient dependancy, I change the code to “loss_sum += batch_avg_loss.numpy()” to force the calculation happen out of pytorch context. And the memory growth run away.
The way I find the problem is quite native, gradually comment so lines and run until the program memory don’t grow. Finally I locate the problem code.
Hope it helps.