SelectBackward0 vs AddmmBackward0


When I pass inputs o = model(x) and print o.grad_fn I get an AddmmBackward0.
However, when I try to just take a single input, for example o[1].grad_fn I get a SelectBackward0.
Why is this?

When I use a DataLoader with batch_size=1, I get AddmmBackward0.

Anyways, down the line I have this issue:

>>>o[i] # got this from calling 
tensor([-2.0692,  2.0274], grad_fn=<SelectBackward0>)

# got this from data loader
#for i, data in enumerate(unknown_dataloader):
#  inputs, labels = data
#  outputs = model(inputs)
tensor([[-2.0692,  2.0274]], grad_fn=<AddmmBackward0>)

if I call self.criterion(o[i], labels) I get an error: RuntimeError: size mismatch (got input: [2], target: [1])

How would I fix this for all o? I don’t want to use a dataloader to run the entire inputs in batch sizes of 1.

You are seeing SelectBackward0 because you are indexing/selecting the output via o[0] which is a differentiable operation and are then checking the .grad_fn attribute of this indexed tensor.

You would need to explain the use case a bit more, i.e. which criterion is used, what the output and target shapes are expected to be, why you are indexing the output etc.

I’m using SoftMax & CrossEntropy, this is a classification problem.

I’m implementing a novel approach that uses the values of backpropagation on a trained model, without optimizing. In essence, I want to save the backpropagation gradient values for every input given to me for this task.

       for i, data in enumerate(some_dataloader): # dataloader batch size is 1
            inputs, labels = data
            outputs = model(inputs)
            loss = criterion(outputs, labels) # torch.nn.CrossEntropyLoss()
            # BP

            # get the gradient of some layer
            grad = some_layer.bias.grad # I use this later

Before this loop occurs I have already passed all the data I need through the model, o = model(data)
and so I want to use this output o[i] instead of recalculating it in the loop all over again.

If already on topic, is there a way for this to work on every sample with a larger batch size? For example, loss= criterion(outputs, labels) with batch size 32 would be 32 gradients, rather than 1?

For a per-sample gradient you might want to check functorch and this tutorial which explains the naive and the optimized approach.

1 Like


Would you know of a solution to the first part of my problem? Even if I calculate the gradients efficiently, I would still need to pass the entire test data twice.

Calling backward on slices should work as seen here:

import torch
import torch.nn as nn

model = nn.Linear(10, 10)
x = torch.arange(5).unsqueeze(1).repeat(1, 10).float()

# one-by-one
grads1 = []
for x_ in x:
    x_ = x_.unsqueeze(0)
    out = model(x_)

    g = model.weight.grad.clone()

grads1 = torch.stack(grads1)

# batched forward, sample backward
out = model(x)

grads2 = []
for o in out[:-1]:
    g = model.weight.grad.clone()
out[-1].mean().backward() # clear computation graph
g = model.weight.grad.clone()
grads2 = torch.stack(grads2)

print((grads1 - grads2).abs().max())
# tensor(0.)

and I think it depends on your use case which approach would be the most efficient and most convenient one.

1 Like