In another thread it has been pointed out that grad_input
is gradients w.r.t. to the inputs of the last operation of the layer, while grad_output
is the gradient w.r.t the output of the layer. So which of these will be used in the next step of the chain rule (the gradient of the layer preceding this one)?
Does grad_input
contain gradients w.r.t the parameters only, and thus is not used in further computations, or is it passed on for the next computation. Or is grad_output
used for the next computation?
Is there a tutorial out there which talks about how grad_input
and grad_output
are used in the computation (particular to pytorch, not chain rule in general)