Register_backward_hook on nn.Sequential

supakjk · February 13, 2017, 6:59am

I tested register_backward_hook on nn.Sequential as below.

import torch
import torch.nn as nn
from torch.autograd import Variable

a = nn.Sequential(nn.Linear(5,3), nn.Tanh(), nn.Linear(3,2))

def hookFunc(module, gradInput, gradOutput):
	print(len(gradInput))
	for v in gradInput:
		print v
a.register_backward_hook(hookFunc)

input = Variable(torch.randn(4,5))
output = a(input)

target = torch.FloatTensor(4,2).fill_(1)
output.backward(target)

The output is as follows.
3
Variable containing:
-0.1122 0.1216 0.7935
-0.1122 0.1216 0.7935
-0.1122 0.1216 0.7935
-0.1122 0.1216 0.7935
[torch.FloatTensor of size 4x3]

Variable containing:
-0.5910 -0.7340 -0.4239
-0.5910 -0.7340 -0.4239
[torch.FloatTensor of size 2x3]

Variable containing:
 4
 4
[torch.FloatTensor of size 2]

So, it seems that when using register_backward_hook on nn.Sequential, only the gradient related values on the last element of nn.Sequential are returned.

I wonder if this is intended one or not. To get the gradient values for the specific element, should I hook with specifying that element rather than specifying nn.Sequential module?

Another question is, I wonder why the description on http://pytorch.org/docs/_modules/torch/nn/modules/module.html#Module.register_backward_hook says that the hook shouldn’t modify the arguments (gradInput for example.) Are there any reason? Then, what would be a better way to manually modify gradient inputs that would be backward passed to the previous module?

apaszke · February 13, 2017, 12:33pm

Yeah it’s a known bug (GitHub issue), but it’s on hold because of the large autograd refactor going on right now. Sorry for that.

Yes, you should never modify any arguments given to the hook in-place. If you want to replace grad input, you can do out-of-place operations on it, and return new values from the hook.

supakjk · February 13, 2017, 3:54pm

@apaszke Could you elaborate how to modify the gradients of a specific sub-module in the middle of backward pass?

apaszke · February 13, 2017, 4:14pm

Like that:

x = Variable(torch.randn(5, 5), requires_grad=True)
y = x + 2
y.register_hook(lambda grad: grad * 2)
y.sum().backward()
x.grad # is now filled with 2

apaszke · February 13, 2017, 4:14pm

But remember about that container hook problem. Do this only on primitive modules or Variables.

supakjk · February 13, 2017, 5:10pm

@apaszke The example you showed seems for a Variable not a Module. Is there any way I can do simillary on a Module?

What I actually want to do is modifying the input gradients that would be backward passed to the previous modules so that I can do adversarial training.
In Torch7, for example, GradientReversal module modifies the gradInput by mutlplying -1 (not gradients for the module’s weight updates).

Thanks!

apaszke · February 13, 2017, 6:15pm

It works in the same way for modules:

module.register_backward_hook(lambda module, grad_i, grad_o: grad_i * -1)

supakjk · February 14, 2017, 5:29pm

@apaszke Got it. Thanks!

Qizhe_Xie · March 16, 2017, 10:26pm

For a Linear module with parameters W and b, grad_i will be a tuple that includes gradient over input, W and b. So should we instead use the following function?

module.register_backward_hook(lambda module, grad_i, grad_o: (grad_i[0] * -1, grad_i[1], grad_i[2]))

supakjk · March 18, 2017, 9:07pm

That approach worked in my case.

WERush · September 27, 2017, 2:43am

Yes, it is.
In your case, grad_i has two elements, grad_i[0] and grad_i[1].

dlmacedo · November 11, 2017, 1:04am

class DDReLU(nn.Module):
    def __init__(self):
        super(DDReLU, self).__init__()
        self.threshold = nn.Parameter(torch.rand(1), requires_grad=True)
        self.register_backward_hook(lambda module, grad_i, grad_o: (grad_i[0], grad_i[1]*0.01))
        #self.threshold.data.fill_(0.1)
        self.ReLU = nn.ReLU(True)

    def forward(self, x):
        print(self.threshold.data[0])
        return self.ReLU(x + self.threshold) - self.threshold
        #return self.ReLU(x) + self.threshold

Is the code above fine to change the relative learning rate of the new parameter?

By relative learning rate, I mean: The parameter created has a learning rate that is 0.01 times the one used to the other model’s parameters.