The shape of the params has absolutely no relation to the required shape of the argument to out.backward. params is a list containing the weight tensors of the various modules in your net, so in this case the net has two Conv2d layers which have 2 tensors of parameters each (.weight and .bias), and three Linear layers which each have 2 tensors of parameters (.weight and .bias). That makes 10 tensors in total.
The last layer of net has 10 outputs which is why out.backward needs 10 inputs. One per output.
Calling out.backward() will --calculate-the-gradient-of-the-output–. EDIT: out.backward() is equivalent to out.backward(torch.Tensor([1]))
Usually we need the gradient of the loss. e.g.
out = net(input)
loss = torch.nn.functional.mse_loss(out, target)
loss.backward()
Each time you run .backward() the stored gradients for each parameter are updated by adding the new gradients. This allows us to cumulate gradients over several samples or several batches before using the gradients to update the weights. Once you have updated the weights you don’t want to keep those gradients around because if you reuse them, then you will push the weights too far.
Concerning out.backward(), I was mistaken, you are right. It is equivalent to doing out.backward(torch.Tensor([1]))
The params are all declared using Variable(.., requires_grad=True) or something equivalent. This means that whenever you use those params in a calculation pyTorch assumes you are going to want to calculate the gradient with respect to those params and it saves all the details it is going to need in order to calculate those gradients.
The value out was calculated using the params listed in net.parameters(), so whenever you call out.backward the gradient will be calculated with respect to every one of those parameters.
What do you mean? I don’t think that’s what it does.
Assume that you have some input and a function f such that f(input) = output
The gradient you pass to output.backward() is the “gradient of the output”. What happens with backpropagation is that we use the gradient of the output to compute the gradient of the input.
Let’s say your output is not a tensor with size 1. In addition, assume that there’s some function f that you’re acting on the output, so that you get something like newOutput = f(output).
You can pass a gradient grad to output.backward(grad). The idea of this is that if you’re doing backpropagation manually, and you know the gradient of the input of the next layer (f in this case), then you can pass the gradient of the input of the next layer to the previous layer that had output output.
suppose I have a image A of size 720*720 and I want to calculate its gradient.
Hence I will use A.backward() but since A is not scalar I need to give gradient as an argument to backward. I understand that the shape of gradient should also be 720*720 but what should be the value of it?
Should I randomly initialize a tensor of shape 720*720
You don’t calculate gradients of a tensor, you calculate gradients of an operation done on that tensor.
Now coming back to your question, now if you actually came up with an operation that returned a 720^2 tensor then yes, you would want to pass in a gradient tensor equivalent to that shape. The reason why you pass this tensor is explained in Andrej Karpathy’s lecture. Go take a look.