How to efficiently compute gradient for each training sample?

aaronhan223 · November 4, 2019, 11:48pm

Hi folks,

There is a problem that has bothered me for quite a long time. Assume we are minimizing a loss function WechatIMG4 parameterized by , on samples WechatIMG5 using SGD, where M is the mini-batch size. Since the PyTorch autograd can only be implicitly created for scalar outputs, I am wondering if there is any efficient way to compute the gradient for each sample, i.e., , without setting the batch size equals to 1 and compute in a for loop (which is too slow)?

Thank you for your help!

Sunshine352 · November 5, 2019, 3:28am

loss.backward(gradient=torch.ones_like(loss))
The shape of loss is (B, )，which is the batch size

aaronhan223 · November 5, 2019, 7:04am

Hi Zhaomang,

Thanks so much for your reply! If the loss is set to be a non-scaler, i.e., the sample loss:

loss = F.cross_entropy(predictions, results, reduction='none')
loss.backward(gradient=torch.ones_like(loss))

I’m still confused about how to get the gradient on each sample, if I run the following code:

for params in model.parameters(): 
    print(params.grad)

I expect to get B times the number of tensors returned in the mini-batch case, since there should be an individual result for each sample.

In addition, it would be very helpful if you could point me to the document of Tensor.backward() function (if any), so that I can learn more on this. Thanks again!

Yaroslav_Bulatov · November 5, 2019, 7:10am

If you just have conv/linear layers, you could use this – https://github.com/cybertronai/autograd-hacks#per-example-gradients

Sunshine352 · November 5, 2019, 8:54am

For the params.grad, it is already the accumulated gradient of B samples.
But if you need to obtain the gradient on each sample, you can set
X.retain_grad() before backward. And then through X.grad to get the gradient of loss on each sample.

konstmish · November 10, 2019, 11:44am

Are you going to extend this to other popular layer types, e.g. ReLu and batchnorm? This would make it useful for most widely used architectures such as Resnets

Yaroslav_Bulatov · November 11, 2019, 6:30am

It should work for those architectures as well. What’s missing is support for other layers with trainable parameters, (ie the multiheadattention layer)

konstmish · November 12, 2019, 9:02pm

Not with batch norm. If I understood your implementation correctly, layers with batch normalization are just skipped:

_supported_layers = ['Linear', 'Conv2d']

if layer_type not in _supported_layers:
    continue

Which means that for batch normalization layer (which is trainable), grad1 will not be created.

Yaroslav_Bulatov · November 12, 2019, 9:13pm

Ah right, the gamma, beta parameters. It seems feasible to extend computation explained here to per-example computation https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html

aaronhan223 · November 19, 2019, 9:11pm

Can the current version of autograd-hacks still be used if I apply other activation functions such as Sigmoid / tanh / ReLU? Thanks.

Yaroslav_Bulatov · November 25, 2019, 10:08pm

Yes, any activation function that works with PyTorch autograd should also work here

offsongjh · July 29, 2020, 3:46am

Thanks for per-example-gradients function!

Can I use this function by only importing torch?

I mean will per-example-gradients be an official function of Pytorch in the future?

dampeyrou_charles · May 2, 2022, 8:53am

Hi everyone !

Thank you Yaroslav for your github !

I have found a recent library called functorch that does compute per-sample gradients, check this out if you have similar problems : Per-sample-gradients — functorch 0.1.1 documentation