There is something I don’t quite understand in this tutorial:
There is example that implements a simple linear function in pytorch:
# Note that both forward and backward are @staticmethods @staticmethod # bias is an optional argument def forward(ctx, input, weight, bias=None): ctx.save_for_backward(input, weight, bias) output = input.mm(weight.t()) if bias is not None: output += bias.unsqueeze(0).expand_as(output) return output # This function has only a single output, so it gets only one gradient @staticmethod def backward(ctx, grad_output): # This is a pattern that is very convenient - at the top of backward # unpack saved_tensors and initialize all gradients w.r.t. inputs to # None. Thanks to the fact that additional trailing Nones are # ignored, the return statement is simple even when the function has # optional inputs. input, weight, bias = ctx.saved_tensors grad_input = grad_weight = grad_bias = None # These needs_input_grad checks are optional and there only to # improve efficiency. If you want to make your code simpler, you can # skip them. Returning gradients for inputs that don't require it is # not an error. if ctx.needs_input_grad: grad_input = grad_output.mm(weight) if ctx.needs_input_grad: grad_weight = grad_output.t().mm(input) if bias is not None and ctx.needs_input_grad: grad_bias = grad_output.sum(0) return grad_input, grad_weight, grad_bias
The question is probably simplistic, but if the backward is simply the “gradient formula”, I don’t understand why the gradient is multiplied by grad_output?
In my mind I have :
e.g. say x = input for clarity
b = bias
f (x)= w* x + b
d f/ dx = w
d f/ dw = x
d f/ db = 1
What piece of the puzzle am I missing?