# Different wording of the requirements of `backward` for a function

I need to implement my own function, and I’m struggling to understand the documentation. Coming from a mathematical point of view, I think of the setup as follows: I’m implementing a function f. For autograd to work, it needs to know how to compute the derivative of g∘f when g is a function whose doman matches the codomain of f, assuming it already knows how to compute the derivative of g. By the chain rule, it suffices to provide the derivative of f. How do I fit this into the language of the PyTorch documentation? To me, the derivative of a function from R^m to R^n at some given point is a linear map, not simply a vector. I’d compute the matrix of this linear map, and multiply it with that for g.

The documentation for `backward` says:

It must accept a context ctx as the first argument, followed by as many outputs did forward() return, and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input.

What, precisely, does “each argument is the gradient w.r.t the given output” mean? The gradient of the full composition, possibly with some projection onto some of the first map’s variables, I assume? But then that’s a linear map. In this example the values returned from `backward` seem to be vectors, not matrices (represeting linear maps). Can someone help me clear up my confusion, perhaps by expressing in more mathematical terms what `backward` is supposed to do?

It doesn’t compute the full Jacobian. Instead, for a function `y = f(x)`. Given some `dL/dy` tensor, the backward computes `dL/dx`, which is `dL/dy * dy/dx`.

So, to clarify: suppose I have some function f:R^2 -> R^3. Suppose its formula is f(x,y) = (x^2,xy, y^2). It’s implemented with a `forward` that takes a single `Tensor` of shape `` as argument and returns a single `Tensor` of shape `` as per the formula. By my (apparently wrong) understanding, `backward` should take a `Tensor` `A` of shape `[m, 3]` (for some `m` depending on the function following f in the composition) and return a `Tensor` of shape `[m, 2]` representing the gradient of the composition at some point (obtained from the context object). The gradient of f can be represented by a 3x2 matrix, the Jacobian at the given point, and the value returned from `backward` is simply the matrix product of `A` and this matrix.

What’s the point of taking the `A` argument at all? Or, in your formulation: Why does `backward` need to know dL/dy, when as you say the result we want is just the product of dL/dy and dy/dx? Why isn’t it just the responsibility of `backward` to compute dy/dx (at the given point)?

No, the backward takes in a tensor of shape ``, representing `dL/d output` for some scalar `L`, and returns a tensor of shape ``, representing `[dL/ dx, dL/dy]`.

Because in most cases, computing `dL/dx` given `dL/dy` is computationally cheaper and what people want in optimizing a scalar objective.

Aha! This is highly illuminating! Thank you. One thing I don’t understand though: what is the meaning of the argument given to `backward` when the output leaf (sorry, I don’t know the proper terminology) of the graph is not a scalar? I agree with your comment that having a scalar-valued loss function is what makes sense for deep learning, but it seems to be that the AD framework allows for an output leaf that is vector-valued.

Currently that is not supported by pytorch, but I agree that having ways to compute full Jacobian will be great!

It’s also not supported by most other popular autodiff libraries.

OK, I understand. Thanks!