What is param.grad the gradient of?

snowball · November 12, 2021, 7:56pm

I’m slightly confused as to what param.grad is calculating mathematically?

Here, it says “Computes and returns the sum of gradients of outputs with respect to the inputs.”
https://pytorch.org/docs/stable/autograd.html

Can someone help me interpret this?
Suppose I’m looking at the weight of one neuron in a fully connected hidden layer denoted as w. Suppose my batch size is 64.
Is param.grad for the w the partial derivative dL/dw calculated for each of the 64 raw inputs, and then summed?

KFrank · November 13, 2021, 2:05pm

Hi Snowball!

Yes.

Not exactly.

Let w be a Parameter (or for than matter, just a Tensor that
has requires_grad = True, but is not wrapped in a Parameter),
and let L be a scalar (that is, a tensor with a single element) that
has been calculated from w (and a bunch of other Parameters
and non-Parameter numbers).

If param.gran starts out as 0.0 (or None), then calling
L.backward() populates w.grad with dL / dw. Note that autograd
is calculating the gradient of a single scalar, L, not of a batch of
scalars that then get further processed.

In the typical use case L is the value of a loss function obtained by
passing a batch of inputs through your model to produced a batch
of predictions that are then compared with a batch of ground-truth
targets using a loss function. “Standard” pytorch loss functions will,
by default, calculate the mean of the losses for each batch sample
of the batch to obtain the single scalar value L.

So, to be clear, autograd computes the gradient of a single scalar
L that as already be averaged over the whole batch, rather than
computing the gradient of a whole batch of Ls and then averaging
(or summing) those gradients together.

Consider the case where w is a single scalar, and
L = w**2 + 7*w + 53. No batch, no sum over batch samples;
autograd simply computes the derivative of the given quadratic
polynomial, as computed to produce the value L, with respect to w.

Best.

K. Frank