I need to implement my own function, and I’m struggling to understand the documentation. Coming from a mathematical point of view, I think of the setup as follows: I’m implementing a function f. For autograd to work, it needs to know how to compute the derivative of g∘f when g is a function whose doman matches the codomain of f, assuming it already knows how to compute the derivative of g. By the chain rule, it suffices to provide the derivative of f. How do I fit this into the language of the PyTorch documentation? To me, the derivative of a function from R^m to R^n at some given point is a linear map, not simply a vector. I’d compute the matrix of this linear map, and multiply it with that for g.

It must accept a context ctx as the first argument, followed by as many outputs did forward() return, and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input.

What, precisely, does “each argument is the gradient w.r.t the given output” mean? The gradient of the full composition, possibly with some projection onto some of the first map’s variables, I assume? But then that’s a linear map. In this example the values returned from backward seem to be vectors, not matrices (represeting linear maps). Can someone help me clear up my confusion, perhaps by expressing in more mathematical terms what backward is supposed to do?

So, to clarify: suppose I have some function f:R^2 -> R^3. Suppose its formula is f(x,y) = (x^2,xy, y^2). It’s implemented with a forward that takes a single Tensor of shape [2] as argument and returns a single Tensor of shape [3] as per the formula. By my (apparently wrong) understanding, backward should take a TensorA of shape [m, 3] (for some m depending on the function following f in the composition) and return a Tensor of shape [m, 2] representing the gradient of the composition at some point (obtained from the context object). The gradient of f can be represented by a 3x2 matrix, the Jacobian at the given point, and the value returned from backward is simply the matrix product of A and this matrix.

What’s the point of taking the A argument at all? Or, in your formulation: Why does backward need to know dL/dy, when as you say the result we want is just the product of dL/dy and dy/dx? Why isn’t it just the responsibility of backward to compute dy/dx (at the given point)?

No, the backward takes in a tensor of shape [3], representing dL/d output for some scalar L, and returns a tensor of shape [2], representing [dL/ dx, dL/dy].

Aha! This is highly illuminating! Thank you. One thing I don’t understand though: what is the meaning of the argument given to backward when the output leaf (sorry, I don’t know the proper terminology) of the graph is not a scalar? I agree with your comment that having a scalar-valued loss function is what makes sense for deep learning, but it seems to be that the AD framework allows for an output leaf that is vector-valued.