How does the autograd function works in nonlinear equations?

Hi all

I was trying to understand how does the autograd module and backward function work. In some projects, I’ve seen non-linear and non-convex operations like the following:

z1, z2 = model(x1, x2)

z1 = (z1 - z1.mean(0)) / z1.std(0)
z2 = (z2 - z2.mean(0)) / z2.std(0)
N = z1.shape[0]

c = (z1.T @ z2) / N

loss = -th.diagonal(c).sum()
loss.backward()

Code adapted from here.

This operations although they are not complex, they are highly non-linear (std involves squaring the term, we have a division and the derivative of the division is complex, etc.), and it would take a while to do it by hand. However, Pytorch does it automatically, and I’m trying to understand how this is possible.

I’ve tried to check what are the grad_fn of loss and c and z1 variables and it involve some MmBackward and DivBackward, but I don’t really see how these can be done using Jacobian and gradient products. I was trying to find the code for these backward functions in Pytorch’s Github repository to see if it helps to understand it, but I’m not able to find it.

I would be really grateful if someone could tell me how this work, high level.

Thanks in advance

Hi Victor!

Autograd works in two pieces: First, “building-block” functions know
how to compute their own gradients. Second, the autograd machinery
uses the chain rule to compute (numerically) the gradient of a “composite”
function that has been constructed by stringing together a sequence of
these building-block functions.

So, on the one hand, you can build up complicated nonlinearities by
stringing together simple building-block functions, and autograd will
do the rest by using the chain rule.

On the other hand, a single building-block function can be very complicated
and highly nonlinear, provided that whoever wrote it also implemented its
corresponding gradient function (typically packaged as a backward()
method).

Consider:

>>> import torch
>>> torch.__version__
'1.10.2'
>>> t = torch.arange (5).float()
>>> t.requires_grad = True
>>> t
tensor([0., 1., 2., 3., 4.], requires_grad=True)
>>> t.pow (3).sum().backward()
>>> t.grad
tensor([ 0.,  3., 12., 27., 48.])

Ignoring the sum() (which is just a technical device to illustrate the
derivative of pow() for a number of different arguments), this example
has only a single building-block function, namely pow().

The nonlinear function torch.pow() knows how to compute its gradient
(derivative), and when you call .backward() on a chain of functions that
includes torch.pow(), autograd asks torch,pow() to compute its gradient
when it hits that step in applying the chain rule to the overall composite
function.

There is no magic – just a lot of work in making autograd’s chain-rule
processing work correctly and in making sure that each and every
building-block function computes its gradient correctly.

Best.

K. Frank