How to save memory by not tracking some variable in training process?

I am trying to save some memory in training process to avoid OOM.
Here is the original code block:

entropy = torch.sum(-(pred1_softmax + 1e-9).log() * pred1_softmax, dim=1)   # pred1_softmax size: [1, 192, 384, 1248]

Running the line above will cause an increase of 700+ MB in memory occupancy. I guess it is because pytorch saves -(pred1_softmax + 1e-9).log(), which is a big tensor with the same size of pred1_softmax, in order to compute grad in the backward pass. So, I want to wrap the block with torch.no_grad. Here is the modified code:

with torch.no_grad():
    entropy = torch.sum(-(pred1_softmax + 1e-9).log() * pred1_softmax, dim=1)

However, it doesn’t work. It still occupies more than 700MB memory and I am confused.

Any help would be appreciated.

use torch.no_grad() will only disable gradient, but entropy is still in GPU. You can use following code to profile you GPU memory usage.

import torch as t

data = t.rand(1, 192, 384, 1248)
print(t.cuda.memory_reserved(0))
data = data.cuda()
print(t.cuda.memory_reserved(0))
data = t.softmax(data, -1)
print(t.cuda.memory_reserved(0))
data.requires_grad = True
print(t.cuda.memory_reserved(0))
# entropy = t.sum(-(data + 1e-9).log() * data, dim=1)
# print(t.cuda.memory_reserved(0))
# del entropy
# print(t.cuda.memory_reserved(0))
with t.no_grad():
    entropy = t.sum(-(data + 1e-9).log() * data, dim=1)
    print(t.cuda.memory_reserved(0))

Thank you for your reply!
I know entropy is still in GPU, what I want to do is to eliminate the intermediate tensor -(pred1_softmax + 1e-9).log(), cause it is much larger than entropy itself.

# data represent for the intermediate tensor of shape [1, 192, 384, 1248]
# entropy is the tensor of shape [1, 384, 1248]

>>> data = torch.rand([1, 192, 384, 1248], device='cuda')
>>> torch.cuda.memory_allocated() / 1024 / 1024    # convert to MB
352

>>> entropy = torch.rand([1, 384, 1248], device='cuda')
>>> torch.cuda.memory_allocated() / 1024 / 1024
1.828125

Again, really appreciate your help! Your answer inspired me.
torch.no_grad() actually works, the intermediate tensor is not in GPU anymore. However, the occupancy does increase a lot. I find that part of memory is cached.

And using

torch.cuda.empty_cache()

can release the memory.