Hi, I just read through this clear and straightforward tutorial on defining custom Autograd functions in Pytorch:
https://pytorch.org/tutorials/beginner/examples_autograd/two_layer_net_custom_function.html
The code is very clear, but one thing Iād like to clarify on is the comment in the code that says āIn the backward pass we receive a Tensor containing the gradient of the loss
with respect to the output, and we need to compute the gradient of the loss
with respect to the input.ā
Shouldnāt it be āwe need to compute the gradient of the output
with respect to the inputā?
Iām confused how we could define the gradient of the loss with respect to the input, since typically we think of the gradient of a function as the gradient of that functionās output with respect to its input.
Forgive me if I misunderstood, but Iām also somewhat confused about what exactly is referred to by āgradient of the loss with respect to the outputā. In the formulation of backprop that Iām familiar with, the functionās gradient is simply multiplied to the recursive sum of forward gradients (to put it into very loose English). I could give a very short and hopefully simple formal summary of that formulation if needed; it would definitely help me to understand how Pytorchās autograd fits into my existing understanding.
Thanks so much for any further clarifications!
Your loss L
is a function of y
(target) and y_hat
(output). The first step in backpropagation is dL/dy_hat
(gradient of the loss with respect to the output) and thatās what you are reading in the tutorial.
1 Like
In the general case, for a function not belonging to the output layer, would the correct interpretation be āwe need to compute the gradient of the output with respect to the inputā?
Sure, thatās a reasonable interpretation.
Let me clarify. The ReLU function is f(x)=max(0,x)
. It means if x<=0
then f(x)=0
, else f(x)=x
. In the first case, when x<0
, the derivative of f(x)
with respect to x
is f'(x)=0
. So, we perform grad_input[input < 0] = 0
. In other words, we "compute the gradient of the output with respect to the inputā and not "compute the gradient of the loss with respect to the inputā.
Oh okay, sorry, didnāt see your reply before posting. So my understanding in that regard is correct?
Not in the original post but yes in the next post .
Not sure which posts youāre referring to, I donāt think my understanding has changed. But Iām going to assume weāre on the same page about the meaning of the second half of the comment as for the first half of the comment, āgradient of the loss with respect to the outputā, is there a general-case interpretation for that as well? I need to use a custom Autograd function that belongs to arbitrary hidden layers, not just output layers. Would it help if I gave some quick backprop formulas so that we could synchronize our language in a formal way?
Basically, Iām trying to clarify how the variable grad_output
in that tutorial is computed. Is there an explicit formula written anywhere? If not, I can quickly derive backprop in simple terms and maybe we can identify where grad_output
fits into that derivation. It really wouldnāt be as complicated as it sounds. I just need to figure out whether I can use this for my research.
@Sam_Lerman I can assure you my knowledge of Autograd is atrocious . However, just in case, did you take a look at this post Exact meaning of grad_input and grad_output?
On a side note, the two posts I was referring to were on either side of my first reply. Your original post and your subsequent post to my reply. In my minimal understanding of backpropagation and automatic differentiation, I would not necessarily think the docstring you were referring to is either wrong or misleading. Itās best left to the individual to interpret it in a way that could aid them in generalizing their intuition (which could be wrong too ).