Accessing Local Gradient information

sawal86 · October 23, 2019, 2:23pm

albanD · October 23, 2019, 4:13pm

Hi,

These values are not needed as is, only their product with dL/dA is needed. So these values are not always computed actually. It depends on the function f and sometimes the product is computed without ever creating this matrix explicitly.

Yaroslav_Bulatov · October 23, 2019, 5:15pm

These values are not needed for reverse AD which is what PyTorch uses. It instead computes values like dloss/dA2 as intermediates, which you can access using tensor hook on A2.

sawal86 · October 23, 2019, 8:06pm

Thank you for your reply. Please correct me if I am wrong. Based on my understanding of backprop, local gradients are needed to compute the gradient that is used for performing the updates. For example

sawal86 · October 23, 2019, 8:08pm

Thanks for your reply. However, as far as I know, local gradients are necessary to compute the gradient needed for performing the updates.

albanD · October 23, 2019, 8:24pm

As @Yaroslav_Bulatov mentioned, the backward function computes, given a dl/dA, dl/dW.
The formula tells you that dl/dW = dl/dA * dA/dW but sometimes dA/dW has a structure such that it is not needed to actually compute it.

For the sum operation for example, if A = W.sum() (assuming W being 1D), then dA/dW = 1 (a 1D vector full of ones). In such case, dl/dW can be computed by simply expanding dl/dA to the size of W. And dA/dW is never needed as a matrix to compute gradients.

sawal86 · October 23, 2019, 8:52pm

But what about the cases where dA/dW is needed to be computed explicitly?

albanD · October 23, 2019, 9:09pm

In these cases it is computed as a temporary variable and deleted before exiting the function to reduce memory consumption.

sawal86 · October 23, 2019, 10:03pm

I see. Is there anyway to access that temporary variable?

albanD · October 24, 2019, 2:18pm

You can call .retain_grad() on it before the backward call. That way, its .grad field will be populated.

llseek · September 6, 2022, 9:08am

looks like .retain_grad() retains dl/dW instead of dA/dW ?