After reading the documents for
register_full_backward_hook here and
register_forward_pre_hook here. I was wondering if I could ask a few clarification questions that aren’t clear within the documents?
From my understanding,
register_full_backward_hook returns the
grad_input variables which are the gradient of the loss with respect to the
input respectively. And, for
register_forward_pre_hook returns the
input of a given layer.
For example, I use a
nn.Linear module like
Linear(in_features=2, out_features=32, bias=True). where the input data has shape
[B, A, 2] where
B is the batch and
A is another dimension to the input shape. When applying a
input shape is the same as my input which makes sense (it’s shape
[B, A, 2]). When applying a
grad_output shape is
[B, 32, A] which makes sense. Is it right to assume that this Tensor is the individial gradients of the loss with respect to the layer output for all inputs? I.e. each element here represent the gradient of the loss for a given input before any kind of reduction operation like
Also, is it possible to reconstruct the gradient of the layer from its
grad_output variables? For example, with a
nn.Linear layer, its gradient of the loss can be decomposed into the gradient of the loss with respect to the output of the layer (aka
grad_output) and the gradient of the output of the layer with respect to the weights of the layer (which for a Linear layer is just the input features, which are returned by
forward_pre_hook). I tried doing the following to reconstruct the gradients but it’s not always the same as getting
grad_output #has shape [B,32,A] input #has shape [B,2,A] grad_from_hooks = torch.einsum("bia,boa->io", grad_output, input) grad_from_module = nn.Linear.weight.grad ratio = grad_from_hooks / grad_from_modules #should be 1 if equal.
In my case this ratio value is near 1 for nearly all elements of the Tensor (in range of 0.95 to 1.05). Although, some elements differ by a factor of 2 or 3, and I was wondering why exactly doesn’t this formula work when reconstructing the gradients from the
input Tensors? The expression I saw was the outer-product of the
input should give the gradient (which is what I’ve done within the
Is it at all possible to reconstruct the gradient of the layer by using the layers
input values from
Apologizes for this being quite an in-depth question but I’d appreciate any clarification on this issue!