These values are not needed as is, only their product with
dL/dA is needed. So these values are not always computed actually. It depends on the function f and sometimes the product is computed without ever creating this matrix explicitly.
These values are not needed for reverse AD which is what PyTorch uses. It instead computes values like dloss/dA2 as intermediates, which you can access using tensor hook on A2.
Thank you for your reply. Please correct me if I am wrong. Based on my understanding of backprop, local gradients are needed to compute the gradient that is used for performing the updates. For example
Thanks for your reply. However, as far as I know, local gradients are necessary to compute the gradient needed for performing the updates.
As @Yaroslav_Bulatov mentioned, the backward function computes, given a
The formula tells you that
dl/dW = dl/dA * dA/dW but sometimes
dA/dW has a structure such that it is not needed to actually compute it.
For the sum operation for example, if
A = W.sum() (assuming W being 1D), then
dA/dW = 1 (a 1D vector full of ones). In such case,
dl/dW can be computed by simply expanding
dl/dA to the size of
dA/dW is never needed as a matrix to compute gradients.
But what about the cases where dA/dW is needed to be computed explicitly?
In these cases it is computed as a temporary variable and deleted before exiting the function to reduce memory consumption.
I see. Is there anyway to access that temporary variable?
You can call
.retain_grad() on it before the backward call. That way, its .grad field will be populated.
looks like .retain_grad() retains dl/dW instead of dA/dW ?