Efficient way to divide samples' gradient contribution

Hi everyone,
I have a classification problem that take an image dataset in input (Cifar10 for example); given a generic batch of the train_loader, which is the most efficient way to get back one gradient vector associated to each class in the batch (or to each sample of the batch)?
I can do it ,for example, iterating over the samples of the batch computing the following steps:

output = model(data[i])
loss = criterion(output,label[i])
loss.backward()

(where is an index associated to a single sample of the batch) and finally copy the gradient associated to the specific sample into a second variable.

This approach seems very inefficient since the samples of the batch are not computed in parallel but in sequence, one by one; is it possible to do it in a more clever way?