Exact meaning of grad_input and grad_output

I had a similar question about this recently and think I finally understand the breakdown of grad_output and grad_input. Here’s what I understand.

Hook parameters are (module, grad_input, grad_output).

Module - current module under inspection. In a simple case I was playing with I had two convolution layers and three fully connected layers. So module in my case was “fc3”, “fc2”, “fc1”, “conv2”, or “conv1”.

grad_input
Pytorch tracks operations within a layer. grad_input contains all information necessary for the forward pass. The information appears to be in different a different order depending on layer type. I looked at fully connected layers and convolution layers.

Convolution Layers
For one of my convolution layers, it received feature maps that were 16x16 (height/width) and 16 layers deep. The batch size was 64. For this layer it also had 32 kernels that were 3x3. In the following I have the input/output index, the shape of the data, and then my understanding of what it is. The grad_input for this looked like the following:

grad_input[0] - [64, 16, 16, 16] - This is the input data. 64 batches, 16 feature maps deep, 16 width, 16 height.
grad_input[1] - [32, 16, 3, 3] - This is the kernel weight data. 32 kernels with 16 depth (to match number if input feature maps), and 3x3 height/width.
grad_input[2] - [32] - This is the bias for each kernel

This information is everything necessary to do the forward pass, entire batch of data, kernel weights, bias values.

Fully Connected Layers
This is a similar idea to the convolution layers, but appears to be in a different order. This fully connected layer had 84 inputs and 10 outputs. (i.e. the previous fully connected layer had 84 nodes and this one has 10 outputs). Batch size of 64.

grad_input[0] - [10] - Bias values.
grad_input[1] - [64, 84] - Data. The first value is the 64 batches, 84 inputs from the previous layer.
grad_input[2] - [84, 10] - Layer weights. Each node in the fully connected layer receives the 84 outputs from the previous layer. There are 10 nodes.

The grad_input contains everything necessary for calculating the forward pass, all batch data inputs, node weights, and node biases. I suspect this information is useful if one is interested in looking at how weights or inputs to the current layer change over time.

grad_output

The grad output contains (as Thomas V mentioned) the gradients of the loss with respect to the layer output. For my examples above, this is what I see:

Convolution Layer

From my previous example, my convolution layer has 32 kernels. The feature maps/data being fed into the layer is 16 deep and 16x16 height/width.

grad_output[0] - [64, 32, 16, 16] - Batch size of 64 (64 sets of gradients). 32 sets of 16x16 gradients.

Fully Connected Layer

From previous example, my convolution layer had 84 inputs, and 10 outputs. Batch size is 64.

grad_output[0] - [64, 10] - 64 instances within the batch, 10 gradients at the output. We’re given the gradient corresponding to each sample within our batch.

Hopefully this adds some clarity… If I’m off on any of this please feel free to add clarification.

Patrick

11 Likes