Understanding autograd calculation of backprop

I am attempting to re-implement backpropagation on my own for didactic purposes, but am running into some issues. I am trying to work backwards from a simple network, starting with LogSoftmax + NLLLoss, but I am unable to match the calculated gradient of the input to the LogSoftmax layer as calculated by autograd.

import torch
import torch.nn

new_relu_feats = torch.Tensor([[1,0,3]])
new_relu_feats.requires_grad = True
logits = torch.nn.LogSoftmax(dim=1)(new_relu_feats)
label = torch.LongTensor([1])
loss = torch.nn.NLLLoss()(logits, label)

sm = torch.nn.Softmax(dim=1)(new_relu_feats)
dloss_dlogits = torch.Tensor([[0,-1,0]])
logits.grad # This matches above
dlogits_dnew_relu_feat = torch.Tensor([[[1-sm[0][0], -sm[0][0], -sm[0][0]], [-sm[0][1], 1-sm[0][1], -sm[0][1]], [-sm[0][2], -sm[0][2], 1-sm[0][2]]]])

dloss_dlogits * dlogits_dnew_relu_feat
# This does not entirely match above, but the middle column does (corresponding to the correct class)

Is this correct, but the matrix is eventually reshaped for efficiency by simply selecting the portions that are non-zero (aka corresponding to the correct class)?


You can use triple backticks ``` before and after your code to have nicer formatting.

The combination of gradients if not an element wise product but a matrix matrix multiplications.
You can do torch.bmm(dlogits_dnew_relu_feat, dloss_dlogits.unsqueeze(-1)).squeeze(-1) to get what you want.
The unsqueeze/squeeze in the last dimension is just to have a dummy dimension of size 1 to make bmm happy :slight_smile:

Thank you so much! That did the trick and I can match these layers, along with a different example that includes a Relu before the softmax. I am still a little unclear why it is dlogits_dnew_relu_feat @ dloss_dlogits rather than the other way round (dloss_dlogits @ dlogits_dnew_relu_feat) from a conceptual point of view.

Ah, I see now. The derivative for LogSoftMax was transposed from how I was thinking about it conceptually:

dlogits_dnew_relu_feat = torch.Tensor([[[1-sm[0][0], -sm[0][0], -sm[0][0]], [-sm[0][1], 1-sm[0][1], -sm[0][1]], [-sm[0][2], -sm[0][2], 1-sm[0][2]]]])

should be

dlogits_dnew_relu_feat = torch.Tensor([[[1-sm[0][0], -sm[0][1], -sm[0][2]], [-sm[0][0], 1-sm[0][1], -sm[0][2]], [-sm[0][0], -sm[0][1], 1-sm[0][2]]]])

and now you can do dloss_dlogits @ dlogits_dnew_relu_feat with no need for squeeze/unsqueeze

1 Like