Hi, I came across a simple use case of pytorch autograd function, I have batch of input to a network X of shape [N, H], and the loss is L of shape [N,]. now I want to get the graident of each loss sample independently w.r.t to the network layer weight w, my workaround now is to loop over the N samples, and call the torch.autograd.grad function multiple times:

def compute_grad(loss_batch, weight):
"""
loss_batch, batched loss for multiple samples, [N,]
weight, the network layer weight for computing the gradients
"""
grads = []
for i in range(len(loss_bbox)):
grad_persample = torch.autograd.grad(loss_batch[i], weight, retain_graph=True)[0]
grads.append(grad_persample)
grads = torch.stack(grads)
return grads

is there a batched way to implement this simple use case?

If you want to batch (or vectorize) over a function, you’ll want to have a look at the torch.func namespace. The documentation is here: torch.func — PyTorch 2.4 documentation

Hi @AlphaBetaGamma96 thanks for your suggestion, can torch.func parallelize the gradient computation? if so, could you pls suggest simple implementation code. As currently, the major problem with my naive loop implementation is that it is very slow.

It can vectorize over a for-loop, yes. Just bare in mind, you can’t mix torch.autograd.grad with torch.func, so you’ll need to re-write how your loss function is defined.

I’m sorry, do you mean it actually can parallelize the torch.autograd.grad computation? and I need to implement the loss function? but currently the loss is already computed from the network, and the gradient is w.r.t to a single layer in the network

The way torch.func works (at least to my understanding) is that it defines a function for a single sample (i.e. computing a derivative) and then you compose this with torch.func.vmap to vectorize over an entire batch of samples. So, all samples are computing in parallel instead of using a for-loop, like you might with torch.autograd.grad.

So if you have function that computes the derivative of the loss with respect to some inputs, torch.func can vectorize over this function for multiple inputs (alleviating the overhead of calling autograd multiple times instead of just once).

Thanks for the suggestion, the current difficulty is that the function is implicitly computed in the my model forward pass (not like the example here, the function is simplely defined with a few lines), and I just want mimic torch.autograd.grad to pass in the loss and weight, then the computation graph could be used to compute gradient

thanks for the suggestion, I found the per-sample graident example at Per-sample-gradients — functorch nightly documentation, it seems that the example ‘functionalize’ a model, but currently I just want compute the gradient w.r.t to a single layer, and implement this part of computation within the module’s forward function, I’m wondering if that is possible?

I’ll have a try, thank you. But I really think it could be good if torch.autograd.grad could incorporate a per-sample gradient computation feature, it currently have is_grads_batched, but seems not for this use case?

yes, but that returns the gradient with the same shape as grad_persample = torch.autograd.grad(loss_batch[i], weight, retain_graph=True)[0], e.g., [4800, 1024] (which is the shape of the weight), the value is different though. I would expect per-sample gradient as shape, e.g., [N, 4800, 1024], where N is the number of samples.

Yes, that might implicitly sum, but I’m not too sure at the moment. It can be used to compute per-sample gradients, but it can become quite limited. That’s why I recommended torch.func as it’s much more efficient, but it does require some re-writes of your code away from the torch.autograd namespace.