Memory leak if use torch.no_grad()?

Here’s my code:

for epoch in range(10):
    net.train()
    # Good training.
    for data in trainloader:
        inputs, labels = data['images'], data['masks']

        for idx in range(0, len(inputs), 7):
            optimizer.zero_grad()

            outputs = net(inputs[idx:idx + 7])
            loss = criterion(outputs, labels[idx:idx + 7])
            loss.backward()
            optimizer.step()

    # Bad validation.
    net.eval()
    test_loss = 0.0
    test_times = 0
    for data in testloader:
        with torch.no_grad():
            inputs, labels = data['images'], data['masks']

            for idx in range(0, len(inputs), 7):
                outputs = net(inputs[idx:idx + 7])
                loss = criterion(outputs, labels[idx:idx + 7])
                test_loss += loss.item()
                test_times += 1
    test_loss /= test_times

If I use torch.no_grad() block, the cpu memory will continually increase untill OOM kill happens.
But once I remove the no_grad, everything would be all right.
I tried del loss or put the validation into a function, but the memory leak still happens.
Is my code wrong?

1 Like

Hi,

The code looks ok. And I’m really not sure how no_grad can cause this.

Do you have a small code sample (30/40 lines) that reproduces this?

Hello,
I will make a sample code of it later.
Thanks for your reply~

1 Like

Hi,
Thanks for your following up. It’s midnight here, and the GPUs are occupied. So I’m going to test it in a few days.

One interesting thing I found is that validation on CPU works fine via no_grad with same code.

My complex tensor slice might be potential reason, but I’m confused why the CPU RAM leak, not the GPU’s.

I am having a similar issue to that which OP posted. I have commented out all my training methods, the only method that remains is similar to an RL setting where I use a network to get an action for an environment. However, despite wrapping the main call in an torch.no_grad() I am still getting memory leakage (happens on both CPU and GPU). The code is quite long, and I’m not sure I could make it more concise, but do you know what a common cause to look for would be? I’m losing my mind, as usually these memory issues are caused by computational graphs not being freed but everything is now without a grad so I’ve no idea what can cause it.

1 Like