There is something I don’t quite understand in this tutorial:

https://pytorch.org/docs/stable/notes/extending.html

There is example that implements a simple linear function in pytorch:

# Inherit from Function

class LinearFunction(Function):

```
# Note that both forward and backward are @staticmethods
@staticmethod
# bias is an optional argument
def forward(ctx, input, weight, bias=None):
ctx.save_for_backward(input, weight, bias)
output = input.mm(weight.t())
if bias is not None:
output += bias.unsqueeze(0).expand_as(output)
return output
# This function has only a single output, so it gets only one gradient
@staticmethod
def backward(ctx, grad_output):
# This is a pattern that is very convenient - at the top of backward
# unpack saved_tensors and initialize all gradients w.r.t. inputs to
# None. Thanks to the fact that additional trailing Nones are
# ignored, the return statement is simple even when the function has
# optional inputs.
input, weight, bias = ctx.saved_tensors
grad_input = grad_weight = grad_bias = None
# These needs_input_grad checks are optional and there only to
# improve efficiency. If you want to make your code simpler, you can
# skip them. Returning gradients for inputs that don't require it is
# not an error.
if ctx.needs_input_grad[0]:
grad_input = grad_output.mm(weight)
if ctx.needs_input_grad[1]:
grad_weight = grad_output.t().mm(input)
if bias is not None and ctx.needs_input_grad[2]:
grad_bias = grad_output.sum(0)
return grad_input, grad_weight, grad_bias
```

The question is probably simplistic, but if the backward is simply the “gradient formula”, I don’t understand why the gradient is multiplied by grad_output?

In my mind I have :

e.g. say x = input for clarity

w= weight

b = bias

f (x)= w* x + b

d f/ dx = w

d f/ dw = x

d f/ db = 1

What piece of the puzzle am I missing?