Per-sample gradient, should we design each layer differently?

AlphaBetaGamma96 · October 28, 2021, 12:40pm

I’ve managed to create a “relatively” simple example to get the per-sample gradient for the output of my network w.r.t the network’s parameters for all samples and they match with the sequential per-sample method. So, that’s working.

I’m trying to derive it for another loss function which contains the Laplacian, the way in which I calculate the Laplacian is via using the trick shown here. This trick allows you to calculate the Laplacian with batch-support whereas doing it via torch.autograd.functional.hessian which only supports 1 example per call.

As I store the grad_output within a dictionary with the key ['e'] every time I call torch.autograd.grad (which happens N+1 times for an NxN Hessian matrix) it’ll overwrite the grad_output for each layer with the final component of the laplacian as opposed to the entire laplacian itself. Is this method (as shown here) compatible with getting a per-sample gradient via hooks if the per-sample gradients depend on other derivative information as well, which with the code in the link is the laplacian of the output of the network w.r.t the input! (If this makes sense?)

thank you for your help!

p.s. is it possible to share the colab example document publicly? (i.e. without requesting access?)