'gradient' argument in out.backward(gradient)

xu-song · January 23, 2018, 4:58pm

Q1: what is the purpose of `gradient` argument?

gradient will perform a dot product on out, right?

what is the objective of this operation?

Q2: Why `gradient` argument is required for tensor outputs?

Can we use the following backward, which not specify the gradient argument?

out.backward()

It will cause the following error

“grad can be implicitly created only for scalar outputs”

Why is the gradient argument required in tensor outputs? We can set a tensor of ones as the default value of gradient.

jpeg729 · January 23, 2018, 5:20pm

The shape of the params has absolutely no relation to the required shape of the argument to out.backward.
params is a list containing the weight tensors of the various modules in your net, so in this case the net has two Conv2d layers which have 2 tensors of parameters each (.weight and .bias), and three Linear layers which each have 2 tensors of parameters (.weight and .bias). That makes 10 tensors in total.

The last layer of net has 10 outputs which is why out.backward needs 10 inputs. One per output.

Calling out.backward() will --calculate-the-gradient-of-the-output–. EDIT: out.backward() is equivalent to out.backward(torch.Tensor([1]))
Usually we need the gradient of the loss. e.g.

out = net(input)
loss = torch.nn.functional.mse_loss(out, target)
loss.backward()

Each time you run .backward() the stored gradients for each parameter are updated by adding the new gradients. This allows us to cumulate gradients over several samples or several batches before using the gradients to update the weights. Once you have updated the weights you don’t want to keep those gradients around because if you reuse them, then you will push the weights too far.

xu-song · January 23, 2018, 5:33pm

Yes, The shape of the argument should same with the shape of out

jpeg729 · January 23, 2018, 6:00pm

Concerning out.backward(), I was mistaken, you are right. It is equivalent to doing out.backward(torch.Tensor([1]))

The params are all declared using Variable(.., requires_grad=True) or something equivalent. This means that whenever you use those params in a calculation pyTorch assumes you are going to want to calculate the gradient with respect to those params and it saves all the details it is going to need in order to calculate those gradients.

The value out was calculated using the params listed in net.parameters(), so whenever you call out.backward the gradient will be calculated with respect to every one of those parameters.

richard · January 23, 2018, 7:02pm

For your first question, the output of your network out = net(input) has size (1, 10), so you want the gradient you pass in to match that size.

xu-song · January 24, 2018, 3:59am

what does `gradient` argument do in backward()?

I make a test by the following case.

import torch
from torch.autograd import Variable

x = Variable(torch.ones(2, 3), requires_grad=True)
a = Variable(torch.ones(3, 4))
out = torch.mm(x,a)    # shape: (2,4)

gradient = torch.FloatTensor([[1,2,3,4],[5,6,7,8]]) # shape=(2,4)
#gradient = torch.ones(4, 4)
out.backward(gradient)
print(x.grad)

x

x_11	x_12	x_13
x_21	x_22	x_23

out

x_11+x_12+x_13	x_11+x_12+x_13	x_11+x_12+x_13	x_11+x_12+x_13
x_21+x_22+x_23	x_21+x_22+x_23	x_21+x_22+x_23	x_21+x_22+x_23

gradient*out

x_11+x_12+x_13	2x_11+2x_12+2x_13	3x_11+3x_12+3x_13	4x_11+4x_12+4x_13
5x_21+5x_22+5x_23	6x_21+6x_22+6x_23	7x_21+7x_22+7x_23	8x_21+8x_22+8x_23

x.grad

Variable containing:
 10  10  10
 26  26  26

10 = 1+2+3+4+0+0+0+0
26=0+0+0+0+5+6+7+8

I find that, gradient will perform a dot product on output in the above case.
What is the objective of this operation?

jpeg729 · January 24, 2018, 8:23am

Might I suggest a lecture on backpropagation by Andrej Karpathy

richard · January 24, 2018, 3:15pm

What do you mean? I don’t think that’s what it does.

Assume that you have some input and a function f such that f(input) = output

The gradient you pass to output.backward() is the “gradient of the output”. What happens with backpropagation is that we use the gradient of the output to compute the gradient of the input.

xu-song · February 1, 2018, 2:37pm

sorry, I don’t quite understand the gradient of output?
the gradient of output w.r.t. model parameters = d(out)/d(param) = lim(Δout)/(Δparam).

the gradient of output = d(out)? or Δout ? or backward propagation errors?
@richard @jpeg729 can u explain it mathmatically?

richard · February 2, 2018, 5:42pm

Let’s say your output is not a tensor with size 1. In addition, assume that there’s some function f that you’re acting on the output, so that you get something like newOutput = f(output).

You can pass a gradient grad to output.backward(grad). The idea of this is that if you’re doing backpropagation manually, and you know the gradient of the input of the next layer (f in this case), then you can pass the gradient of the input of the next layer to the previous layer that had output output.

wuuw · July 6, 2019, 3:32am

Thanks!!! You really helped me!

singhvishal0209 · March 10, 2020, 6:23pm

suppose I have a image A of size 720*720 and I want to calculate its gradient.

Hence I will use A.backward() but since A is not scalar I need to give gradient as an argument to backward. I understand that the shape of gradient should also be 720*720 but what should be the value of it?

Should I randomly initialize a tensor of shape 720*720

humble_D · March 18, 2020, 10:21am

You don’t calculate gradients of a tensor, you calculate gradients of an operation done on that tensor.

Now coming back to your question, now if you actually came up with an operation that returned a 720^2 tensor then yes, you would want to pass in a gradient tensor equivalent to that shape. The reason why you pass this tensor is explained in Andrej Karpathy’s lecture. Go take a look.

'gradient' argument in out.backward(gradient)

Q1: what is the purpose of gradient argument?

Q2: Why gradient argument is required for tensor outputs?

what does gradient argument do in backward()?

Q1: what is the purpose of `gradient` argument?

Q2: Why `gradient` argument is required for tensor outputs?

what does `gradient` argument do in backward()?