Help with pytorch "memory leak" on CPU

Hi,

I’m currently developing a differentiable physics engine using pytorch (2.1.0) that combines physics equations and machine learning. I am however seeing a memory leak (running on cpu, haven’t tried on gpu) where the memory continues to increase epoch after epoch. This is not a python memory, but instead a computational graph / gradient leak where tensors aren’t being released after I call backward(). While I’m searching for where the memory leak is occurring exactly, I’m tried this naive thing:

for _ in range(10):
  trainer = TrainingEngine()
  trainer.train(epoch=1)
  del trainer
  gc.collect()

But this code still results in memory that increases continually. Can someone explain to me why even after I delete the top level module that contains all tensors that the memory held by the computational graph is not released? I know ideally, I should find the exact place where tensors are created but not used, but is there a simple way to just tell pytorch to clear the computational graph at the end of each epoch? Thanks!

Unfortunately not if you have created references that keep the computation graph alive. Check if you are appending any loss or model output to e.g. a list somewhere.

I also noticed a weird behavior.

Opening up the train() method a little bit:

def train():
    curr_states = torch.zeros(n, 3)
    curr_state = ...
    for _ in range(n):
        curr_state = self.step(curr_state)
        curr_states[i] = curr_state
    loss = loss_fn(curr_states)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

I noticed that when n=1, memory doesn’t increase no matter how long I run it, but when n > 1, thats when memory leaks. Does this info help?

Is curr_states used outside of this method or stored somewhere else? Does the memory decrease again if you delete it?