Defining Autograd functions tutorial - typo?

Hi, I just read through this clear and straightforward tutorial on defining custom Autograd functions in Pytorch:
https://pytorch.org/tutorials/beginner/examples_autograd/two_layer_net_custom_function.html

The code is very clear, but one thing Iā€™d like to clarify on is the comment in the code that says ā€œIn the backward pass we receive a Tensor containing the gradient of the loss
with respect to the output, and we need to compute the gradient of the loss
with respect to the input.ā€

Shouldnā€™t it be ā€œwe need to compute the gradient of the output
with respect to the inputā€?

Iā€™m confused how we could define the gradient of the loss with respect to the input, since typically we think of the gradient of a function as the gradient of that functionā€™s output with respect to its input.

Forgive me if I misunderstood, but Iā€™m also somewhat confused about what exactly is referred to by ā€œgradient of the loss with respect to the outputā€. In the formulation of backprop that Iā€™m familiar with, the functionā€™s gradient is simply multiplied to the recursive sum of forward gradients (to put it into very loose English). I could give a very short and hopefully simple formal summary of that formulation if needed; it would definitely help me to understand how Pytorchā€™s autograd fits into my existing understanding.

Thanks so much for any further clarifications!

Your loss L is a function of y (target) and y_hat(output). The first step in backpropagation is dL/dy_hat (gradient of the loss with respect to the output) and thatā€™s what you are reading in the tutorial.

1 Like

In the general case, for a function not belonging to the output layer, would the correct interpretation be ā€œwe need to compute the gradient of the output with respect to the inputā€?

Sure, thatā€™s a reasonable interpretation.

Let me clarify. The ReLU function is f(x)=max(0,x) . It means if x<=0 then f(x)=0 , else f(x)=x . In the first case, when x<0 , the derivative of f(x) with respect to x is f'(x)=0 . So, we perform grad_input[input < 0] = 0 . In other words, we "compute the gradient of the output with respect to the inputā€ and not "compute the gradient of the loss with respect to the inputā€.

Oh okay, sorry, didnā€™t see your reply before posting. So my understanding in that regard is correct?

Not in the original post but yes in the next post :wink: .

Not sure which posts youā€™re referring to, I donā€™t think my understanding has changed. But Iā€™m going to assume weā€™re on the same page about the meaning of the second half of the comment :slight_smile: as for the first half of the comment, ā€œgradient of the loss with respect to the outputā€, is there a general-case interpretation for that as well? I need to use a custom Autograd function that belongs to arbitrary hidden layers, not just output layers. Would it help if I gave some quick backprop formulas so that we could synchronize our language in a formal way?

Basically, Iā€™m trying to clarify how the variable grad_output in that tutorial is computed. Is there an explicit formula written anywhere? If not, I can quickly derive backprop in simple terms and maybe we can identify where grad_output fits into that derivation. It really wouldnā€™t be as complicated as it sounds. I just need to figure out whether I can use this for my research.

@Sam_Lerman I can assure you my knowledge of Autograd is atrocious :grin:. However, just in case, did you take a look at this post :point_right: Exact meaning of grad_input and grad_output?

On a side note, the two posts I was referring to were on either side of my first reply. Your original post and your subsequent post to my reply. In my minimal understanding of backpropagation and automatic differentiation, I would not necessarily think the docstring you were referring to is either wrong or misleading. Itā€™s best left to the individual to interpret it in a way that could aid them in generalizing their intuition (which could be wrong too :wink: ).