Running the line above will cause an increase of 700+ MB in memory occupancy. I guess it is because pytorch saves -(pred1_softmax + 1e-9).log(), which is a big tensor with the same size of pred1_softmax, in order to compute grad in the backward pass. So, I want to wrap the block with torch.no_grad. Here is the modified code:

with torch.no_grad():
entropy = torch.sum(-(pred1_softmax + 1e-9).log() * pred1_softmax, dim=1)

However, it doesnâ€™t work. It still occupies more than 700MB memory and I am confused.

Thank you for your reply!
I know entropy is still in GPU, what I want to do is to eliminate the intermediate tensor -(pred1_softmax + 1e-9).log(), cause it is much larger than entropy itself.

# data represent for the intermediate tensor of shape [1, 192, 384, 1248]
# entropy is the tensor of shape [1, 384, 1248]
>>> data = torch.rand([1, 192, 384, 1248], device='cuda')
>>> torch.cuda.memory_allocated() / 1024 / 1024 # convert to MB
352
>>> entropy = torch.rand([1, 384, 1248], device='cuda')
>>> torch.cuda.memory_allocated() / 1024 / 1024
1.828125

Again, really appreciate your help! Your answer inspired me. torch.no_grad() actually works, the intermediate tensor is not in GPU anymore. However, the occupancy does increase a lot. I find that part of memory is cached.