I’m slightly confused as to what param.grad is calculating mathematically?
Here, it says “Computes and returns the sum of gradients of outputs with respect to the inputs.”
Can someone help me interpret this?
Suppose I’m looking at the weight of one neuron in a fully connected hidden layer denoted as w. Suppose my batch size is 64.
Is param.grad for the w the partial derivative dL/dw calculated for each of the 64 raw inputs, and then summed?
w be a
Parameter (or for than matter, just a
requires_grad = True, but is not wrapped in a
L be a scalar (that is, a tensor with a single element) that
has been calculated from
w (and a bunch of other
param.gran starts out as
None), then calling
dL / dw. Note that autograd
is calculating the gradient of a single scalar,
L, not of a batch of
scalars that then get further processed.
In the typical use case
L is the value of a loss function obtained by
passing a batch of inputs through your model to produced a batch
of predictions that are then compared with a batch of ground-truth
targets using a loss function. “Standard” pytorch loss functions will,
by default, calculate the mean of the losses for each batch sample
of the batch to obtain the single scalar value
So, to be clear, autograd computes the gradient of a single scalar
L that as already be averaged over the whole batch, rather than
computing the gradient of a whole batch of
Ls and then averaging
(or summing) those gradients together.
Consider the case where
w is a single scalar, and
L = w**2 + 7*w + 53. No batch, no sum over batch samples;
autograd simply computes the derivative of the given quadratic
polynomial, as computed to produce the value
L, with respect to