In-place operation causes a memory leak. Why?

I have the following piece of code:

eigvecs = torch.randn(n, b, dtype=eigval_approximations.dtype, device=eigval_approximations.device)
eigvecs /= torch.linalg.norm(eigvecs, dim=0)

for _ in range(iterations):
    eigvecs = torch.linalg.solve(mats - eigval_approximations * identity, eigvecs.T).T
    eigvecs /= torch.linalg.norm(eigvecs, dim=0)

It is part of loss calculation in PyTorch, and it causes GPU OOM after a few epochs.

eigvecs is re-initialized at every training/validation step, and it is of the same size at each step. It exists in a very short-lived scope.

Changing this piece of code such that it avoids in-place division (/=) eliminates the memory leak.

Why was there a memory leak due to an in-place operation in the first place? I don’t see any reason for this.

Is it expected in Python, or might it be an implementation detail of or a bug in PyTorch?

Does anyone have an idea?

Hi @Amos_Haviv_Hason,

I’m not quite sure, so it’d be best to get a dev’s opinion on this, but I do know the torch.linalg.solve syncs with the CPU when it’s run on the GPU (so perhaps there’s a memory leak from that?).

In the documentation of torch.linalg.solve, it states:

When inputs are on a CUDA device, this function synchronizes that device with the CPU. For a version of this function that does not synchronize, see torch.linalg.solve_ex()

So, you could try replacing torch.linalg.solve_ex method (docs here: torch.linalg.solve_ex — PyTorch 2.4 documentation) and see if you get the same memory leak? That would be one way to test this hypothesis (although not conclusive)

Also, if you’re trying to use in-place operations to speed-up pytorch (it’ll make minimal difference in this use case). If you’re purely computing the eigenvalues (and don’t want any gradients), run your code within a torch.no_grad() context manager, which will speed up your code. Docs here: no_grad — PyTorch 2.4 documentation