Since I call the mode within the no_grad context, I expect no intermedium nodes will be created. But the result memory usage is no effect by the context. I wonder it is a feature or a bug for PyTorch? If it is a feature, how these cached memory will be used in the future?
I am a bit confused here, do you expect the memory usage to go down to what is needed for the bare PyTorch/CUDA context? The variables and model layers are still in scope so they are expected to use memory.
Sorry to cause the confusion. What I expect is that after I call model.forward with no_grad, only the outputs, i.e. o and h, occupy the memory and all the local variables inside the forward call is automatically freed. But as the example shows, I need to manually call torch.cuda.empty_cache() to free them.
Thanks for the clarification! I think this behavior suggests that even with no_grad, extra memory is allocated first and the deallocated in the caching allocator rather than being skipped. I wonder if this is true for other layers as well…
As @eqy said, intermediate tensors must be computed at one point even if they are released afterwards. Otherwise you wouldn’t be able to pass any forward activations to the next operation. Since you are observing the usage of intermediates in RNNs, my guess is that the intermediate forward activations are computed before being passed to the next time step and released afterwards. This would also explain, why the memory is in the cache (not allocated).