Hi, I just read through this clear and straightforward tutorial on defining custom Autograd functions in Pytorch:

https://pytorch.org/tutorials/beginner/examples_autograd/two_layer_net_custom_function.html

The code is very clear, but one thing I’d like to clarify on is the comment in the code that says “In the backward pass we receive a Tensor containing the gradient of the loss

with respect to the output, and we need to compute the gradient of the loss

with respect to the input.”

Shouldn’t it be “we need to compute the gradient of the *output*

with respect to the input”?

I’m confused how we could define the gradient of the loss with respect to the input, since typically we think of the gradient of a function as the gradient of that function’s output with respect to its input.

Forgive me if I misunderstood, but I’m also somewhat confused about what exactly is referred to by “gradient of the loss with respect to the output”. In the formulation of backprop that I’m familiar with, the function’s gradient is simply multiplied to the recursive sum of forward gradients (to put it into very loose English). I could give a very short and hopefully simple formal summary of that formulation if needed; it would definitely help me to understand how Pytorch’s autograd fits into my existing understanding.

Thanks so much for any further clarifications!

Your loss `L`

is a function of `y`

(target) and `y_hat`

(output). The first step in backpropagation is `dL/dy_hat`

(gradient of the loss with respect to the output) and that’s what you are reading in the tutorial.

1 Like

In the general case, for a function not belonging to the output layer, would the correct interpretation be “we need to compute the gradient of the *output* with respect to the input”?

Sure, that’s a reasonable interpretation.

Let me clarify. The ReLU function is `f(x)=max(0,x)`

. It means if `x<=0`

then `f(x)=0`

, else `f(x)=x`

. In the first case, when `x<0`

, the derivative of `f(x)`

with respect to `x`

is `f'(x)=0`

. So, we perform `grad_input[input < 0] = 0`

. In other words, we "compute the gradient of the *output* with respect to the input” and not "compute the gradient of the loss with respect to the input”.

Oh okay, sorry, didn’t see your reply before posting. So my understanding in that regard is correct?

Not in the original post but yes in the next post .

Not sure which posts you’re referring to, I don’t think my understanding has changed. But I’m going to assume we’re on the same page about the meaning of the second half of the comment as for the first half of the comment, “gradient of the loss with respect to the output”, is there a general-case interpretation for that as well? I need to use a custom Autograd function that belongs to arbitrary hidden layers, not just output layers. Would it help if I gave some quick backprop formulas so that we could synchronize our language in a formal way?

Basically, I’m trying to clarify how the variable `grad_output`

in that tutorial is computed. Is there an explicit formula written anywhere? If not, I can quickly derive backprop in simple terms and maybe we can identify where `grad_output`

fits into that derivation. It really wouldn’t be as complicated as it sounds. I just need to figure out whether I can use this for my research.

@Sam_Lerman I can assure you my knowledge of Autograd is atrocious . However, just in case, did you take a look at this post Exact meaning of grad_input and grad_output?

On a side note, the two posts I was referring to were on either side of my first reply. Your original post and your subsequent post to my reply. In my minimal understanding of backpropagation and automatic differentiation, I would not necessarily think the docstring you were referring to is either wrong or misleading. It’s best left to the individual to interpret it in a way that could aid them in generalizing their intuition (which could be wrong too ).