I am trying to set up some simulated data and a simple neural net for better understanding of the fundamentals:

``````import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

nrows = 9000
ntrain = int(nrows * .7)
X = torch.rand(nrows, 3)
Y = torch.mm(X, torch.from_numpy(
np.array([[.1], [2], [3]]).astype(np.float32)))
# concat two tensors, like hstack in numpy
# Y = torch.cat([Y < torch.mean(Y), Y >= torch.mean(Y)], dim=1).type(torch.LongTensor)
Y = (Y >= torch.mean(Y)).type(torch.LongTensor).view(nrows)
Xtr = X[:ntrain, :]
Ytr = Y[:ntrain]
Xte = X[ntrain:, :]
Yte = Y[ntrain:]

else :

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.hooked = False
self.fc1 = nn.Linear(3, 20)
self.relu1 = nn.ReLU()
self.fc2 = nn.Linear(20, 30)
self.relu2 = nn.ReLU()
self.fc3 = nn.Linear(30, 2)
self.fc1_hook_handle = self.fc1.register_backward_hook(self.fc1_backward_hook)
self.fc2_hook_handle = self.fc2.register_backward_hook(self.fc2_backward_hook)
self.fc3_hook_handle = self.fc3.register_backward_hook(self.fc3_backward_hook)
def forward(self, x):
x = self.fc1(x)
x = F.relu(x)
x = self.fc2(x)
x = F.relu(x)
x = self.fc3(x)
return x
def fc1_backward_hook(self, module, grad_input, grad_output):  # module is Linear in this case. Ignored.

net = Net().cuda()
print(net)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=.8)
NUM_EPOCH = 2
NUM_PER_BATCH = 4

# # one pass backprop
# index_pool = np.arange(Xtr.size(0))
# indices = np.random.choice(index_pool, size=NUM_PER_BATCH, replace=False)
# inputs = Xtr[indices, :].cuda()
# labels = Ytr[torch.from_numpy(indices)].cuda()
# inputs, labels = Variable(inputs), Variable(labels)
# outputs = net(inputs)
# loss = criterion(outputs, labels)
# loss.backward()
# optimizer.step()
# running_loss += loss.data.item()

NUM_EPOCH = 2
NUM_PER_BATCH = 4
index_pool = np.arange(Xtr.size(0))
for epoch in range(NUM_EPOCH):  # loop over the dataset multiple times
running_loss = 0.0
for i in index_pool:
indices = np.random.choice(
index_pool, size=NUM_PER_BATCH, replace=False)
inputs = Xtr[indices, :].cuda()
labels = Ytr[torch.from_numpy(indices)].cuda()
inputs, labels = Variable(inputs), Variable(labels)
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.data.item()
if i % 2000 == 1999:    # print every 2000 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0

accuracy = torch.mean(
torch.eq(
torch.max(
net(Variable(Xte.cuda())),
dim=1
)[1].cpu(),
Yte
).type(torch.FloatTensor)
)
print("Accuracy of prediction on test dataset: %f" % accuracy.item())

print(
)

print(
)

``````

Each layer in the neural net has a backward hook, but I donâ€™t understand what `grad_input` and `grad_output` actually mean. Could anyone explain? Thanks.

7 Likes

well your loss is backpropagated in your network starting from the end.

1 Like

Ok, letâ€™s get a bit more specific:

``````>>> print(grad_dict["fc2"]["grad_input"][0][0].size())
torch.Size([4, 30])
torch.Size([4, 20])
torch.Size([20, 30])
torch.Size([4, 30])
``````

`grad_input` is a 3-tuple, my guess:

• [0] is the derivative of loss wrt layer input
• [1] is the derivative of loss wrt layer output (before activation)
• [2] is the derivative of loss wrt layer weights

`grad_output` is a 1-tuple, perhaps itâ€™s the derivative of loss wrt layer output after activation?

Please correct me if I am wrong.

It actually is a bit more complicated:

• `grad_output` is the gradient of the loss w.r.t. the layer output. So if you have a layer `l` and do, say, `y = l(x) ; loss = y.sum(); loss.backward()`, you get the gradient of `loss` w.r.t. `y`.
• `grad_input` are the inputs to the last operation in the layer. This may not quite be what you have expectedâ€¦ For linear layers, this is fairly complete, as the last op is torch.addmm multiplying the input with the weight and adding the bias. For other layers (e.g. do a Sequential, itâ€™ll be the last op of the last layer, the inputs not even remotely related to the sequential layerâ€™s inputs). You can see what will be used by looking at `y.grad_fn`.

So to be honest, I donâ€™t know what the exact use case for that would be and I certainly cannot comment on the exact design choice for that, but you can see how a module hook is turned into a hook on `grad_fn` in the source of torch/nn/modules/module.py.

Best regards

Thomas

15 Likes

This should really be part of the documentation. Itâ€™s really unintelligible to me.

18 Likes

Iâ€™m sorry for digging out this topic but I didnâ€™t want to create a new one for a simple question.
Why is `grad_output` a tuple which length is the batch size but `grad_input` contains the batch as its first dimension ? Whatâ€™s behind this design choice ?

Thanks !

1 Like

Further to this - would it not make sense to also have the forward IO parameters in the backward hook? They should already be in memory.

• To debug e.g. errors / spikes in gradients one often wants all access to all variables (fwd IO, bwd IO and Parameter grads) the bwd operation used.

Module - current module under inspection. In a simple case I was playing with I had two convolution layers and three fully connected layers. So module in my case was â€śfc3â€ť, â€śfc2â€ť, â€śfc1â€ť, â€śconv2â€ť, or â€śconv1â€ť.

Pytorch tracks operations within a layer. grad_input contains all information necessary for the forward pass. The information appears to be in different a different order depending on layer type. I looked at fully connected layers and convolution layers.

Convolution Layers
For one of my convolution layers, it received feature maps that were 16x16 (height/width) and 16 layers deep. The batch size was 64. For this layer it also had 32 kernels that were 3x3. In the following I have the input/output index, the shape of the data, and then my understanding of what it is. The grad_input for this looked like the following:

grad_input[0] - [64, 16, 16, 16] - This is the input data. 64 batches, 16 feature maps deep, 16 width, 16 height.
grad_input[1] - [32, 16, 3, 3] - This is the kernel weight data. 32 kernels with 16 depth (to match number if input feature maps), and 3x3 height/width.
grad_input[2] - [32] - This is the bias for each kernel

This information is everything necessary to do the forward pass, entire batch of data, kernel weights, bias values.

Fully Connected Layers
This is a similar idea to the convolution layers, but appears to be in a different order. This fully connected layer had 84 inputs and 10 outputs. (i.e. the previous fully connected layer had 84 nodes and this one has 10 outputs). Batch size of 64.

grad_input[0] - [10] - Bias values.
grad_input[1] - [64, 84] - Data. The first value is the 64 batches, 84 inputs from the previous layer.
grad_input[2] - [84, 10] - Layer weights. Each node in the fully connected layer receives the 84 outputs from the previous layer. There are 10 nodes.

The grad_input contains everything necessary for calculating the forward pass, all batch data inputs, node weights, and node biases. I suspect this information is useful if one is interested in looking at how weights or inputs to the current layer change over time.

The grad output contains (as Thomas V mentioned) the gradients of the loss with respect to the layer output. For my examples above, this is what I see:

Convolution Layer

From my previous example, my convolution layer has 32 kernels. The feature maps/data being fed into the layer is 16 deep and 16x16 height/width.

grad_output[0] - [64, 32, 16, 16] - Batch size of 64 (64 sets of gradients). 32 sets of 16x16 gradients.

Fully Connected Layer

From previous example, my convolution layer had 84 inputs, and 10 outputs. Batch size is 64.

grad_output[0] - [64, 10] - 64 instances within the batch, 10 gradients at the output. Weâ€™re given the gradient corresponding to each sample within our batch.

Hopefully this adds some clarityâ€¦ If Iâ€™m off on any of this please feel free to add clarification.

Patrick

11 Likes

Can anyone confirm this?

when I use pytorch 1.2 ,grad_in is a tuple with three parameters but it changes when in 1.7. In fact,it just has two parameters. So do you know what changed,thanks!

2 Likes