For torch.autograd Variable, there’s the ‘gradient’ input param for the .backward() function. I don’t quite understand what this ‘gradient’ input param stands for, and why this param is needed?

To call `.backward()`

you need gradient wrt the output. It is needed as part of the chain rule / backpropagation algorithm.

Note that if you’re calling it on a loss/cost variable, you don’t need to provide an argument since autograd will assume a tensor of ones.

A more precise mathematical definition would be to say that what we often refer to as “gradients” are Jacobian-vector products, and the `gradient`

is the vector you want to multiply with it (for the loss it’s initialized as [1] by default, so it gives the Jacobian itself).

Hi,

I implement a skip-gram model using pytorch. However, I find the backward speed depends on the embedding size. The larger is the embedding size, the slower is the backward speed. So I’m wondering if the backward will calculate the gradient of each variable, even the variable that is not used in this step?

For example, I have an Embedding variable which is 10,000*100. When I used 10th and 20th embedding, will backward function calculate the gradient only for this two embeddings or will it calculate the gradient for all 10,000 embeddings?

I also have trouble with this. It says that the argument of backward should be the gradients w.r.t. the output. But consistently across the documentation it is never mentioned of what the gradient w.r.t. the output (must be a conspiracy to drive me crazy). Clearly you can define the gradient as an operator but then you don’t get numbers (like 1). Can the thing of which you take the gradient w.r.t. the output not be stated concisely? Does it depend on some usecase? If I just have y=f(x) for example how do I derive the 1 that is assumed?

Is it the gradient of the eventual downstream loss with respect to the current layer? So that in the case of a scalar loss which is also the “most downstream output/loss” we get dloss/dloss =1 but if we want to get backward() from some middle layer we have to provide the gradient of the downstream loss w.r.t. all the outputs of this middle layer (evaluated at the current values of those outputs) in order to get well defined numerical results. This makes sense to me and actually occurs in backprop.

In more technical terms. Let y be an arbitrary node in a computational graph If we call y.backward(arg) the argument arg to backward should be the gradient of the root of the computational graph with respect to y evaluated at a specific value of y (usually the current value of y). If y is a whole layer, this means that arg should provide a value for each neuron in y. If y is th final loss it is also the root of the graph and we get the usual scalar one as the only reasonable argument arg.

Am I getting there?

Yes, that’s correct. We only support differentiation of scalar functions, so if you want to start backward form a non-scalar value you need to provide `dout / dy`