Per-sample gradient, should we design each layer differently?

Hi @Yaroslav_Bulatov,

I’ve just made a relatively short example that illustrates my point where this method fails if your loss function depends on the derivative of other terms. I’ve made an example script here on Github and I think this explains my current issue with hooks for per-sample gradients at the moment.

Could this issue be a potential bug of using hooks for such derivatives or is it a limitation when using the batch-supported Laplacian trick I reference in my previous response?

Thank you for all your help! It’s greatly appreciated! :slight_smile:

For completeness, it seems that you can’t use this method if your loss function contains multiple derivatives (unless I’m mistaken?)

A useful example is here → https://github.com/AlphaBetaGamma96/per-sample-gradient-limitation/blob/main/example.py . This might just be a limitation with how I store the grad_output variables if anyone does figure out a way around this, do let me know!