Exact meaning of grad_input and grad_output

I am trying to set up some simulated data and a simple neural net for better understanding of the fundamentals:

import torch
import torch.optim as optim
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

nrows = 9000
ntrain = int(nrows * .7)
X = torch.rand(nrows, 3)
Y = torch.mm(X, torch.from_numpy(
    np.array([[.1], [2], [3]]).astype(np.float32)))
# concat two tensors, like hstack in numpy
# Y = torch.cat([Y < torch.mean(Y), Y >= torch.mean(Y)], dim=1).type(torch.LongTensor)
Y = (Y >= torch.mean(Y)).type(torch.LongTensor).view(nrows)
Xtr = X[:ntrain, :]
Ytr = Y[:ntrain]
Xte = X[ntrain:, :]
Yte = Y[ntrain:]

grad_dict: dict = {}

def fc_hook(layer_name, grad_input, grad_output): 
    if layer_name in grad_dict:
        grad_dict[layer_name]["grad_input"].append(grad_input)
        grad_dict[layer_name]["grad_output"].append(grad_output)
    else :
        grad_dict[layer_name] = {}
        grad_dict[layer_name]["grad_input"] = []
        grad_dict[layer_name]["grad_output"] = []


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.hooked = False
        self.fc1 = nn.Linear(3, 20)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(20, 30)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(30, 2)
        self.fc1_hook_handle = self.fc1.register_backward_hook(self.fc1_backward_hook)
        self.fc2_hook_handle = self.fc2.register_backward_hook(self.fc2_backward_hook)
        self.fc3_hook_handle = self.fc3.register_backward_hook(self.fc3_backward_hook)
    def forward(self, x):
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        return x
    def fc1_backward_hook(self, module, grad_input, grad_output):  # module is Linear in this case. Ignored.
        fc_hook("fc1", grad_input, grad_output)
    def fc2_backward_hook(self, module, grad_input, grad_output):
        fc_hook("fc2", grad_input, grad_output)
    def fc3_backward_hook(self, module, grad_input, grad_output):
        fc_hook("fc3", grad_input, grad_output)



net = Net().cuda()
print(net)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=.8)
NUM_EPOCH = 2
NUM_PER_BATCH = 4

# # one pass backprop
# index_pool = np.arange(Xtr.size(0))
# indices = np.random.choice(index_pool, size=NUM_PER_BATCH, replace=False)
# inputs = Xtr[indices, :].cuda()
# labels = Ytr[torch.from_numpy(indices)].cuda()
# inputs, labels = Variable(inputs), Variable(labels)
# outputs = net(inputs)
# optimizer.zero_grad()
# loss = criterion(outputs, labels)
# loss.backward()
# optimizer.step()
# running_loss += loss.data.item()

NUM_EPOCH = 2
NUM_PER_BATCH = 4
index_pool = np.arange(Xtr.size(0))
for epoch in range(NUM_EPOCH):  # loop over the dataset multiple times
    running_loss = 0.0
    for i in index_pool:
        indices = np.random.choice(
            index_pool, size=NUM_PER_BATCH, replace=False)
        inputs = Xtr[indices, :].cuda()
        labels = Ytr[torch.from_numpy(indices)].cuda()
        inputs, labels = Variable(inputs), Variable(labels)
        outputs = net(inputs)
        optimizer.zero_grad()
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.data.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

accuracy = torch.mean(
    torch.eq(
        torch.max(
            net(Variable(Xte.cuda())),
            dim=1
        )[1].cpu(),
        Yte
    ).type(torch.FloatTensor)
)
print("Accuracy of prediction on test dataset: %f" % accuracy.item())

print(
    grad_dict["fc2"]["grad_input"][0][0]
)

print(
    grad_dict["fc2"]["grad_output"][0][0]
)

grad_dict["fc2"]["grad_input"][1][1] == grad_dict["fc2"]["grad_output"][1][1]
print(grad_dict["fc2"]["grad_input"][0][0].size())
print(grad_dict["fc2"]["grad_input"][0][1].size())
print(grad_dict["fc2"]["grad_input"][0][2].size())
print(grad_dict["fc2"]["grad_output"][0][0].size())

Each layer in the neural net has a backward hook, but I don’t understand what grad_input and grad_output actually mean. Could anyone explain? Thanks.

6 Likes

well your loss is backpropagated in your network starting from the end.
Grad_input is the gradient entering the layer from behind and grad_output is the gradient exiting the layer from the front of it.

Ok, let’s get a bit more specific:

>>> print(grad_dict["fc2"]["grad_input"][0][0].size())
torch.Size([4, 30])
>>> print(grad_dict["fc2"]["grad_input"][0][1].size())
torch.Size([4, 20])
>>> print(grad_dict["fc2"]["grad_input"][0][2].size())
torch.Size([20, 30])
>>> print(grad_dict["fc2"]["grad_output"][0][0].size())
torch.Size([4, 30])

grad_input is a 3-tuple, my guess:

  • [0] is the derivative of loss wrt layer input
  • [1] is the derivative of loss wrt layer output (before activation)
  • [2] is the derivative of loss wrt layer weights

grad_output is a 1-tuple, perhaps it’s the derivative of loss wrt layer output after activation?

Please correct me if I am wrong.

It actually is a bit more complicated:

  • grad_output is the gradient of the loss w.r.t. the layer output. So if you have a layer l and do, say, y = l(x) ; loss = y.sum(); loss.backward(), you get the gradient of loss w.r.t. y.
  • grad_input are the inputs to the last operation in the layer. This may not quite be what you have expected… For linear layers, this is fairly complete, as the last op is torch.addmm multiplying the input with the weight and adding the bias. For other layers (e.g. do a Sequential, it’ll be the last op of the last layer, the inputs not even remotely related to the sequential layer’s inputs). You can see what will be used by looking at y.grad_fn.

So to be honest, I don’t know what the exact use case for that would be and I certainly cannot comment on the exact design choice for that, but you can see how a module hook is turned into a hook on grad_fn in the source of torch/nn/modules/module.py.

Best regards

Thomas

9 Likes

This should really be part of the documentation. It’s really unintelligible to me.

11 Likes

I’m sorry for digging out this topic but I didn’t want to create a new one for a simple question.
Why is grad_output a tuple which length is the batch size but grad_input contains the batch as its first dimension ? What’s behind this design choice ?

Thanks !

can someone please tell what is grad_input and grad_output in pytorch clearly!

Further to this - would it not make sense to also have the forward IO parameters in the backward hook? They should already be in memory.

  • To debug e.g. errors / spikes in gradients one often wants all access to all variables (fwd IO, bwd IO and Parameter grads) the bwd operation used.

I had a similar question about this recently and think I finally understand the breakdown of grad_output and grad_input. Here’s what I understand.

Hook parameters are (module, grad_input, grad_output).

Module - current module under inspection. In a simple case I was playing with I had two convolution layers and three fully connected layers. So module in my case was “fc3”, “fc2”, “fc1”, “conv2”, or “conv1”.

grad_input
Pytorch tracks operations within a layer. grad_input contains all information necessary for the forward pass. The information appears to be in different a different order depending on layer type. I looked at fully connected layers and convolution layers.

Convolution Layers
For one of my convolution layers, it received feature maps that were 16x16 (height/width) and 16 layers deep. The batch size was 64. For this layer it also had 32 kernels that were 3x3. In the following I have the input/output index, the shape of the data, and then my understanding of what it is. The grad_input for this looked like the following:

grad_input[0] - [64, 16, 16, 16] - This is the input data. 64 batches, 16 feature maps deep, 16 width, 16 height.
grad_input[1] - [32, 16, 3, 3] - This is the kernel weight data. 32 kernels with 16 depth (to match number if input feature maps), and 3x3 height/width.
grad_input[2] - [32] - This is the bias for each kernel

This information is everything necessary to do the forward pass, entire batch of data, kernel weights, bias values.

Fully Connected Layers
This is a similar idea to the convolution layers, but appears to be in a different order. This fully connected layer had 84 inputs and 10 outputs. (i.e. the previous fully connected layer had 84 nodes and this one has 10 outputs). Batch size of 64.

grad_input[0] - [10] - Bias values.
grad_input[1] - [64, 84] - Data. The first value is the 64 batches, 84 inputs from the previous layer.
grad_input[2] - [84, 10] - Layer weights. Each node in the fully connected layer receives the 84 outputs from the previous layer. There are 10 nodes.

The grad_input contains everything necessary for calculating the forward pass, all batch data inputs, node weights, and node biases. I suspect this information is useful if one is interested in looking at how weights or inputs to the current layer change over time.

grad_output

The grad output contains (as Thomas V mentioned) the gradients of the loss with respect to the layer output. For my examples above, this is what I see:

Convolution Layer

From my previous example, my convolution layer has 32 kernels. The feature maps/data being fed into the layer is 16 deep and 16x16 height/width.

grad_output[0] - [64, 32, 16, 16] - Batch size of 64 (64 sets of gradients). 32 sets of 16x16 gradients.

Fully Connected Layer

From previous example, my convolution layer had 84 inputs, and 10 outputs. Batch size is 64.

grad_output[0] - [64, 10] - 64 instances within the batch, 10 gradients at the output. We’re given the gradient corresponding to each sample within our batch.

Hopefully this adds some clarity… If I’m off on any of this please feel free to add clarification.

Patrick