'gradient' argument in out.backward(gradient)

Q1: what is the purpose of gradient argument?

gradient will perform a dot product on out, right?

what is the objective of this operation?

Q2: Why gradient argument is required for tensor outputs?

Can we use the following backward, which not specify the gradient argument?

out.backward()

It will cause the following error

“grad can be implicitly created only for scalar outputs”

Why is the gradient argument required in tensor outputs? We can set a tensor of ones as the default value of gradient.

2 Likes
  1. The shape of the params has absolutely no relation to the required shape of the argument to out.backward.
    params is a list containing the weight tensors of the various modules in your net, so in this case the net has two Conv2d layers which have 2 tensors of parameters each (.weight and .bias), and three Linear layers which each have 2 tensors of parameters (.weight and .bias). That makes 10 tensors in total.

The last layer of net has 10 outputs which is why out.backward needs 10 inputs. One per output.

  1. Calling out.backward() will --calculate-the-gradient-of-the-output–. EDIT: out.backward() is equivalent to out.backward(torch.Tensor([1]))
    Usually we need the gradient of the loss. e.g.
out = net(input)
loss = torch.nn.functional.mse_loss(out, target)
loss.backward()
  1. Each time you run .backward() the stored gradients for each parameter are updated by adding the new gradients. This allows us to cumulate gradients over several samples or several batches before using the gradients to update the weights. Once you have updated the weights you don’t want to keep those gradients around because if you reuse them, then you will push the weights too far.
3 Likes

Yes, The shape of the argument should same with the shape of out

Concerning out.backward(), I was mistaken, you are right. It is equivalent to doing out.backward(torch.Tensor([1]))

The params are all declared using Variable(.., requires_grad=True) or something equivalent. This means that whenever you use those params in a calculation pyTorch assumes you are going to want to calculate the gradient with respect to those params and it saves all the details it is going to need in order to calculate those gradients.

The value out was calculated using the params listed in net.parameters(), so whenever you call out.backward the gradient will be calculated with respect to every one of those parameters.

For your first question, the output of your network out = net(input) has size (1, 10), so you want the gradient you pass in to match that size.

what does gradient argument do in backward()?

I make a test by the following case.

import torch
from torch.autograd import Variable

x = Variable(torch.ones(2, 3), requires_grad=True)
a = Variable(torch.ones(3, 4))
out = torch.mm(x,a)    # shape: (2,4)

gradient = torch.FloatTensor([[1,2,3,4],[5,6,7,8]]) # shape=(2,4)
#gradient = torch.ones(4, 4)
out.backward(gradient)
print(x.grad)

x

x_11 x_12 x_13
x_21 x_22 x_23

out

x_11+x_12+x_13 x_11+x_12+x_13 x_11+x_12+x_13 x_11+x_12+x_13
x_21+x_22+x_23 x_21+x_22+x_23 x_21+x_22+x_23 x_21+x_22+x_23

gradient*out

x_11+x_12+x_13 2x_11+2x_12+2x_13 3x_11+3x_12+3x_13 4x_11+4x_12+4x_13
5x_21+5x_22+5x_23 6x_21+6x_22+6x_23 7x_21+7x_22+7x_23 8x_21+8x_22+8x_23

x.grad

Variable containing:
 10  10  10
 26  26  26

10 = 1+2+3+4+0+0+0+0
26=0+0+0+0+5+6+7+8


I find that, gradient will perform a dot product on output in the above case.
What is the objective of this operation?

Might I suggest a lecture on backpropagation by Andrej Karpathy

1 Like

What do you mean? I don’t think that’s what it does.

Assume that you have some input and a function f such that f(input) = output

The gradient you pass to output.backward() is the “gradient of the output”. What happens with backpropagation is that we use the gradient of the output to compute the gradient of the input.

2 Likes

sorry, I don’t quite understand the gradient of output?
the gradient of output w.r.t. model parameters = d(out)/d(param) = lim(Δout)/(Δparam).

the gradient of output = d(out)? or Δout ? or backward propagation errors?
@richard @jpeg729 can u explain it mathmatically?

1 Like

Let’s say your output is not a tensor with size 1. In addition, assume that there’s some function f that you’re acting on the output, so that you get something like newOutput = f(output).

You can pass a gradient grad to output.backward(grad). The idea of this is that if you’re doing backpropagation manually, and you know the gradient of the input of the next layer (f in this case), then you can pass the gradient of the input of the next layer to the previous layer that had output output.

10 Likes

Thanks!!! You really helped me!

suppose I have a image A of size 720*720 and I want to calculate its gradient.

Hence I will use A.backward() but since A is not scalar I need to give gradient as an argument to backward. I understand that the shape of gradient should also be 720*720 but what should be the value of it?

Should I randomly initialize a tensor of shape 720*720

1 Like

You don’t calculate gradients of a tensor, you calculate gradients of an operation done on that tensor.

Now coming back to your question, now if you actually came up with an operation that returned a 720^2 tensor then yes, you would want to pass in a gradient tensor equivalent to that shape. The reason why you pass this tensor is explained in Andrej Karpathy’s lecture. Go take a look.