# How does the autograd function works in nonlinear equations?

Hi all

I was trying to understand how does the `autograd` module and `backward` function work. In some projects, I’ve seen non-linear and non-convex operations like the following:

``````z1, z2 = model(x1, x2)

z1 = (z1 - z1.mean(0)) / z1.std(0)
z2 = (z2 - z2.mean(0)) / z2.std(0)
N = z1.shape

c = (z1.T @ z2) / N

loss = -th.diagonal(c).sum()
loss.backward()
``````

This operations although they are not complex, they are highly non-linear (std involves squaring the term, we have a division and the derivative of the division is complex, etc.), and it would take a while to do it by hand. However, Pytorch does it automatically, and I’m trying to understand how this is possible.

I’ve tried to check what are the `grad_fn` of `loss` and `c` and `z1` variables and it involve some `MmBackward` and `DivBackward`, but I don’t really see how these can be done using Jacobian and gradient products. I was trying to find the code for these backward functions in Pytorch’s Github repository to see if it helps to understand it, but I’m not able to find it.

I would be really grateful if someone could tell me how this work, high level.

Hi Victor!

Autograd works in two pieces: First, “building-block” functions know
uses the chain rule to compute (numerically) the gradient of a “composite”
function that has been constructed by stringing together a sequence of
these building-block functions.

So, on the one hand, you can build up complicated nonlinearities by
stringing together simple building-block functions, and autograd will
do the rest by using the chain rule.

On the other hand, a single building-block function can be very complicated
and highly nonlinear, provided that whoever wrote it also implemented its
corresponding gradient function (typically packaged as a `backward()`
method).

Consider:

``````>>> import torch
>>> torch.__version__
'1.10.2'
>>> t = torch.arange (5).float()
>>> t
tensor([0., 1., 2., 3., 4.], requires_grad=True)
>>> t.pow (3).sum().backward()
tensor([ 0.,  3., 12., 27., 48.])
``````

Ignoring the `sum()` (which is just a technical device to illustrate the
derivative of `pow()` for a number of different arguments), this example
has only a single building-block function, namely `pow()`.

The nonlinear function `torch.pow()` knows how to compute its gradient
(derivative), and when you call `.backward()` on a chain of functions that
includes `torch.pow()`, autograd asks `torch,pow()` to compute its gradient
when it hits that step in applying the chain rule to the overall composite
function.

There is no magic – just a lot of work in making autograd’s chain-rule
processing work correctly and in making sure that each and every
building-block function computes its gradient correctly.

Best.

K. Frank