What do the grad_in and grad_out of nn.Conv2d consist of?

I want to change the grad of nn.Conv2d in its backward, here is my code:

def fun(module,grad_in,grad_out):
    print(grad_in.shape)     # add break point here

class testNet(nn.Module):
    def __init__(self):
        super(testNet, self).__init__()

    def forward(self, input):
        return self.l3(x.view(2,-1))

if __name__=='__main__':
    # 2 1 28 28

I use the debug of pycharm to figure out the shape of grad_in and grad_out of self.l1, and I find that the grad_in is a tuple with size of 3, grad_out is a tuple with size of 1.
The grad_in is:
and the grad_out is:

And if I uncomment self.l2.register_backward_hook(fun), then I will meet the grad of self.l2 first in debug.
The grad_in is:
and the grad_out is:

The questions are:
1.What does grad_in[2] of nn.Conv2d use for?
2.I tried to return grad_in2, grad_out2 in function fun but an error will raise for the fun should return 3 objects not 2, why is 3 objects?
3.What should I do if I need to store the history of grad which will be used as an information to make the new grad?
4.The thing I want to achieve is giving new grad of the kernels of conv2d according to its history of grad and the current grad. I guess the grad_in[2] maybe a factor that will affect the updating speed of different kernels. Is it enough if I only changed this grad to achieve my goal?

Any help will be appreciated!

Is my understanding right? The grad_in is in order of ( input, weight, bias), which are mentioned in torch.nn.Conv2d. And I can only return one tuple that consists of three Tensor. The error information misleads me. And a good way to store the history is to store it in Module, which may needs me to write a new conv2d module. And if I want to affect the gradient of kernels’ weight, changing the gradient of bias is meaningless. I’d better change the grad_in[1] and grad_in[2].


  1. grad_in[2] is the gradients for the bias as you saw.
  2. The conv takes 3 inputs: (input, weights, bias). Since for the first layer, you don’t need gradients for the input, None is returned. For the second layer, gradients for the input are needed to compute the gradients of the first layer and so are computed.
  3. Not sure what you mean here? what do you want to save exactly? If you want the gradients for weights or bias, you can recover them after the backward pass with the .grad attribute on them· If you want to gradient for the input, you can inp.register_hook(your_saving_function) that will be called with the gradient for that tensor when it’s computed.
  4. If this is your goal, I would do that as a postprocessing step after the backward:
# Whenever you want to reset the history
for p in model.parameters():
  p.history = []

# Inside your training loop
loss.backward() # compute the .grad for all weights
with torch.no_grad(): # deactivate the autograd as you don't want to differentiate through these ops
  for p in model.parameters(): # You can do some filtering here (only conv weights) if you want here
    new_grad = your_fn(p.history) # p.history[-1] is the last gradient
    p.grad.copy_(new_grad) # Use copy to make sure that .grad is still the same Tensor
1 Like

Emm, I’m developing a strategy of moderating the gradient which needs the mean and variance of gradients of Convolution layer in a period of time. The series of gradients of weights and bias of conv2d in the training process are what I want to record. But I find it high memory-cost and replace it with the exponentially weighted moving average method in the last, just like what is used in batchnorm layer.
Though, changing the gradient in a postprocessing step in training loop is very clear, it’s hard to locate the layer I want to change its gradient.

However, your advice offer me a new way of affecting the backward process and makes me understand the details better. Thanks!