Understanding full backward hooks in nn.module

Hello everyone. I’m trying to implement a gradient estimator method (like straight through estimator [hinton 2012]) on a simple covnet.
For this I have decided to use backward hooks on individual layers of the Linear layers of the covnet.
The function snippet which does this looks like this -

def clipped_relu_hook(m,i,o):
    estimated_gradient = torch.matmul(o[0],m.weight)
    estimated_gradient = torch.clip(torch.relu(estimated_gradient),1)
    estimated_gradient = torch.unsqueeze(estimated_gradient, dim=0)
    return estimated_gradient

def register_backward_hook_for_(Model):
    target_modules = []
    for m in Model.modules():
        if isinstance(m,nn.Linear):
            target_modules.append(m)
    # target_modules.pop()
    for modules in target_modules:
        modules.register_full_backward_hook(clipped_relu_hook)

This, however did not work. Then I tried returning a tensor of ones in backward of the linear layers and print them before performing the optimizer step. The function for that hook is -

def clipped_relu_hook(m,i,o):
    estimated_gradient = torch.ones_like(i[0])
    return estimated_gradient

I expected the code to print a tensor of dimensions of the linear layers with all values as 1, which it did not.

Do I not understand the way these hooks work? Would appreciate if someone explained their working mechanism. Also would reallyy appreciate if someone suggested better ways to implement gradient estimators like STE(straight through estimator).
Thank you!

Hi,

These hooks will give you the gradient wrt to each input and each output of the forward function.
And you can optionally return new values to be used instead of the given input gradients.

Note that things like Parameters are not inputs and thus are not considered here.

Thank you for the reply!

I still have a few doubts, which I’d like to mention. The backward hook’s signature looks like this -hook(module, grad_input, grad_output) -> Tensor or None

Is the gradient input the gradient received by the current layer ( the one on which we are applying the hook)? Or is it the value of gradient which is received by the next layer in the back prop order of our model?
One last thing and I’ll stop bothering you : > , is there a way one can control the value back-propagating from one layer to it’s previous one? I’m trying to implement the straight through estimator (for gradient estimation of quantized weight networks) but I’m unable to do it with hooks(yet).

Thank you again for giving your time!

Is the gradient input the gradient received by the current layer ( the one on which we are applying the hook)? Or is it the value of gradient which is received by the next layer in the back prop order of our model?

The grad_input contains the gradients wrt to the input of the layer. Similarly, the grad_output contains the gradients wrt to the output of the layer.
Since backprop works in reverse, grad_output is what got propagated from the next layer while grad_input is what will be sent to the previous one.

As mentioned above, you can return a new value for grad_input that will then be used.

2 Likes

Now I understand it completely.Thank you, you were really helpful!