After reading the documents for register_full_backward_hook
here and register_forward_pre_hook
here. I was wondering if I could ask a few clarification questions that aren’t clear within the documents?
From my understanding, register_full_backward_hook
returns the grad_output
and grad_input
variables which are the gradient of the loss with respect to the output
and input
respectively. And, for register_forward_pre_hook
returns the input
of a given layer.
For example, I use a nn.Linear
module like Linear(in_features=2, out_features=32, bias=True)
. where the input data has shape [B, A, 2]
where B
is the batch and A
is another dimension to the input shape. When applying a forward_pre_hook
the input
shape is the same as my input which makes sense (it’s shape [B, A, 2]
). When applying a backward_full_hook
the grad_output
shape is [B, 32, A]
which makes sense. Is it right to assume that this Tensor is the individial gradients of the loss with respect to the layer output for all inputs? I.e. each element here represent the gradient of the loss for a given input before any kind of reduction operation like torch.mean
?
Also, is it possible to reconstruct the gradient of the layer from its grad_output
variables? For example, with a nn.Linear
layer, its gradient of the loss can be decomposed into the gradient of the loss with respect to the output of the layer (aka grad_output
) and the gradient of the output of the layer with respect to the weights of the layer (which for a Linear layer is just the input features, which are returned by forward_pre_hook
). I tried doing the following to reconstruct the gradients but it’s not always the same as getting nn.Linear.weight.grad
,
grad_output #has shape [B,32,A]
input #has shape [B,2,A]
grad_from_hooks = torch.einsum("bia,boa->io", grad_output, input)
grad_from_module = nn.Linear.weight.grad
ratio = grad_from_hooks / grad_from_modules #should be 1 if equal.
In my case this ratio value is near 1 for nearly all elements of the Tensor (in range of 0.95 to 1.05). Although, some elements differ by a factor of 2 or 3, and I was wondering why exactly doesn’t this formula work when reconstructing the gradients from the grad_ouput
and input
Tensors? The expression I saw was the outer-product of the grad_output
and input
should give the gradient (which is what I’ve done within the torch.einsum
expression).
Is it at all possible to reconstruct the gradient of the layer by using the layers grad_output
and input
values from register_backward_full_hook
and register_forward_pre_hook
?
Apologizes for this being quite an in-depth question but I’d appreciate any clarification on this issue!
Thank you!