I wrote a C++/CUDA extension to produce a gradient tensor, e.g. dLdX, where L is the loss and X is a tensor. When I call X.backward(dLdX), the memory dLdX occupies (allocated in C++) will not get recycled. My question is: how can I tell the computational graph to recycle dLdX once it is no longer needed during the execution of X.backward(dLdX).
How did you check, that the memory is not recycled? Was torch.cuda.memory_allocated() after the operation higher and stayed this way?
If so, are you storing any references or in your extension?
If I call torch.cuda.memory_allocated() after X.backward(dLdX), it does release dLdX's GPU memory. But I am wondering if there is a way to do that in the middle of the backward pass. The analogy would be as if X is an intermediate variable whose gradient gets released as soon as it is no longer needed for the rest of the backprop. Am I making sense?
I made another post after I realized I was not asking the right question – this behavior is not specific to tensors allocated in C++.