Excessive use of Cuda memory calling backward() on new tensor

fabiozappo · October 14, 2022, 6:52am

Hi everyone, this is my first post here so don’t be too mean please.
I am struggling with a weird problem that consumes the double cuda memory of what I expected.

Here is a quick snippet to replicate the problem:

from torchvision.models import resnet50
import torch


# This loss consumes a lot of memory 
class ZeroLoss(torch.nn.Module):
    def forward(self, embeddings):
        return torch.tensor([0.], requires_grad=True)


# This doesn't
class MeanLoss(torch.nn.Module):
    def forward(self, embeddings):
        return embeddings.mean() * 0.0


use_zeroloss = True

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = resnet50().to(device)
x = torch.randn(128, 3, 224, 224).to(device)

optimizer = torch.optim.SGD(model.parameters(), lr=1e-5, momentum=0)

# MeanLoss use 15GB of Vram ZeroLoss use 24GB of Vram
criterion = ZeroLoss() if use_zeroloss else MeanLoss()

while True:
    embeddings = model(x)

    loss = criterion(embeddings)
    loss.backward()

    optimizer.step()
    optimizer.zero_grad()

This is the minimum amount of code to replicate my problem.

Cuda memory used in above script with use_zeroloss = False (up) and use_zeroloss = True (down)

I expected ZeroLoss to save more memory than MeanLoss, but it’s actually the opposite and the gap is huge. Why?!?!?

ptrblck · October 14, 2022, 9:28am

Your ZeroLoss creates a new tensor which is not attached to the computation graph. Calling backward will not compute any gradients for the parameters of the model and will thus also not clear the computation graph and will not delete the intermediate forward activations needed for the gradient computation, which could explain the memory usage. Try to del embeddings and the memory usage should go down.