I’m slightly confused as to what param.grad is calculating mathematically?

Here, it says “Computes and returns the sum of gradients of outputs with respect to the inputs.”

Can someone help me interpret this?
Suppose I’m looking at the weight of one neuron in a fully connected hidden layer denoted as w. Suppose my batch size is 64.
Is param.grad for the w the partial derivative dL/dw calculated for each of the 64 raw inputs, and then summed?

Hi Snowball!

Yes.

Not exactly.

Let `w` be a `Parameter` (or for than matter, just a `Tensor` that
has `requires_grad = True`, but is not wrapped in a `Parameter`),
and let `L` be a scalar (that is, a tensor with a single element) that
has been calculated from `w` (and a bunch of other `Parameter`s
and non-`Parameter` numbers).

If `param.gran` starts out as `0.0` (or `None`), then calling
`L.backward()` populates `w.grad` with `dL / dw`. Note that autograd
is calculating the gradient of a single scalar, `L`, not of a batch of
scalars that then get further processed.

In the typical use case `L` is the value of a loss function obtained by
passing a batch of inputs through your model to produced a batch
of predictions that are then compared with a batch of ground-truth
targets using a loss function. “Standard” pytorch loss functions will,
by default, calculate the mean of the losses for each batch sample
of the batch to obtain the single scalar value `L`.

So, to be clear, autograd computes the gradient of a single scalar
`L` that as already be averaged over the whole batch, rather than
computing the gradient of a whole batch of `L`s and then averaging
Consider the case where `w` is a single scalar, and
`L = w**2 + 7*w + 53`. No batch, no sum over batch samples;
polynomial, as computed to produce the value `L`, with respect to `w`.