GRU model does not free the cached memory automatically

IcarusWizard · September 3, 2021, 7:22am

Hi, everyone. I find that GRU model does not free the cached memory automatically. Here is the code sample I use.

import torch
data = torch.rand(998, 400, 23).cuda() 
# 1183M memory usage
model = torch.nn.GRU(23, 128).cuda()
# 1247M memory usage
with torch.no_grad():
    h, o = model(data)
# 2811M memory usage
torch.cuda.empty_cache()
# 1443M memory usage

import torch
data = torch.rand(998, 400, 23).cuda() 
# 1183M memory usage
model = torch.nn.GRU(23, 128).cuda()
# 1247M memory usage
h, o = model(data)
# 2811M memory usage
torch.cuda.empty_cache()
# 2613M memory usage

Since I call the mode within the no_grad context, I expect no intermedium nodes will be created. But the result memory usage is no effect by the context. I wonder it is a feature or a bug for PyTorch? If it is a feature, how these cached memory will be used in the future?

eqy · September 4, 2021, 11:23pm

I am a bit confused here, do you expect the memory usage to go down to what is needed for the bare PyTorch/CUDA context? The variables and model layers are still in scope so they are expected to use memory.

IcarusWizard · September 5, 2021, 4:03am

Sorry to cause the confusion. What I expect is that after I call model.forward with no_grad, only the outputs, i.e. o and h, occupy the memory and all the local variables inside the forward call is automatically freed. But as the example shows, I need to manually call torch.cuda.empty_cache() to free them.

eqy · September 5, 2021, 9:33pm

Thanks for the clarification! I think this behavior suggests that even with no_grad, extra memory is allocated first and the deallocated in the caching allocator rather than being skipped. I wonder if this is true for other layers as well…

IcarusWizard · September 6, 2021, 5:11am

I have tried GRU, LSTM, Linear and Conv2d, only the recurrent models have this issue.

ptrblck · September 6, 2021, 9:08am

As @eqy said, intermediate tensors must be computed at one point even if they are released afterwards. Otherwise you wouldn’t be able to pass any forward activations to the next operation. Since you are observing the usage of intermediates in RNNs, my guess is that the intermediate forward activations are computed before being passed to the next time step and released afterwards. This would also explain, why the memory is in the cache (not allocated).