Per-sample gradient, should we design each layer differently?

AlphaBetaGamma96 · October 26, 2021, 11:54pm

Hi everyone, and sorry for opening a topic which is over 2 years old.

I was wondering if I could ask a quick question about what precisely the per-sample gradient is here?

What I exactly mean by this is you’re converting the gradient of the layer to get the gradient for all samples by using the grad_output and activations of a given layer, and using torch.einsum to combine the two.

But, I’m not 100% sure what exactly the per-sample gradient is. To illustrate my point, let’s assume I have a set of inputs of shape [B, A] where B is the batch size, and A is the number of inputs. (For example with a simple Feed-Forward Network). Where I would pass these inputs through a loss function (containing a network of parameters) which returns a Tensor of shape [B,] (where each element of this Tensor is effectively an individual loss for each sample in the batch).

If I define the (total) loss as the mean of these individual loss values (so, reduce the Tensor of shape [B,] to a scalar), and backprop that loss via loss.backward() while using the hooks (as mentioned above) to get the gradients of the parameters for all samples.

Would these gradients be either,

the gradient of the (total) loss w.r.t the parameters for the i-th sample within the batch for all samples
the gradient of the individual loss of the i-th sample w.r.t the parameters for all samples

The only reason why I’m asking for this clarification is that I’m trying to get the 2nd case here where I can calculate the gradient of the individual loss values w.r.t the parameters of a nn.Module for all samples.

An example use case would be the KFAC optimizer where you need to compute the exact-fisher for all samples when rescaling your preconditioned gradients.

I’ve managed to make an example code comparing the 2 methods,

using hooks to get the gradient of the (total) loss of the i-th sample for all samples (in a similar way to autograd-hacks)
iterate over the batch and compute the gradient of the individual loss w.r.t the parameter for each sample sequentially and store within a list, which is subsequently stacked to get the same shape as the Tensor returned via the hooks method (which is incredibly slow).

Both these methods return different values which would indicate 2 things. Either I’ve possibly made a mistake somewhere in my code or the gradients returned via the hooks is the grad of the total-loss w.r.t parameters for all samples rather than the grad of the individual losses w.r.t the parameters. I should note all my samples are independent of each other when passed through the loss function!

Apologizes for the long reply for a 2-year-old post but as this isn’t really mentioned within the docs I’d just like to clarify some points!

Any help would be greatly appreciated!

Thank you for your time,
Kind regards!

TL;DR - Are the gradients for all samples which are calculated via this autograd-hacks method the gradient of the total loss with respect to the parameters for all samples or the gradient of each individual loss with respect to the parameters for all samples? (where individual loss means, the loss of a single sample before its reduction to a scalar loss. So, a batch of B samples would have B individual losses.)

Thanks once again!