Understanding autograd calculation of backprop

JakeStevens · November 7, 2019, 6:51pm

I am attempting to re-implement backpropagation on my own for didactic purposes, but am running into some issues. I am trying to work backwards from a simple network, starting with LogSoftmax + NLLLoss, but I am unable to match the calculated gradient of the input to the LogSoftmax layer as calculated by autograd.

import torch
import torch.nn

new_relu_feats = torch.Tensor([[1,0,3]])
new_relu_feats.requires_grad = True
logits = torch.nn.LogSoftmax(dim=1)(new_relu_feats)
logits.retain_grad()
label = torch.LongTensor([1])
loss = torch.nn.NLLLoss()(logits, label)
loss.backward()

sm = torch.nn.Softmax(dim=1)(new_relu_feats)
dloss_dlogits = torch.Tensor([[0,-1,0]])
logits.grad # This matches above
dlogits_dnew_relu_feat = torch.Tensor([[[1-sm[0][0], -sm[0][0], -sm[0][0]], [-sm[0][1], 1-sm[0][1], -sm[0][1]], [-sm[0][2], -sm[0][2], 1-sm[0][2]]]])

dloss_dlogits * dlogits_dnew_relu_feat
# This does not entirely match above, but the middle column does (corresponding to the correct class)
new_relu_feats.grad

Is this correct, but the matrix is eventually reshaped for efficiency by simply selecting the portions that are non-zero (aka corresponding to the correct class)?

albanD · November 7, 2019, 7:18pm

Hi,

You can use triple backticks ``` before and after your code to have nicer formatting.

The combination of gradients if not an element wise product but a matrix matrix multiplications.
You can do torch.bmm(dlogits_dnew_relu_feat, dloss_dlogits.unsqueeze(-1)).squeeze(-1) to get what you want.
The unsqueeze/squeeze in the last dimension is just to have a dummy dimension of size 1 to make bmm happy

JakeStevens · November 7, 2019, 7:43pm

Thank you so much! That did the trick and I can match these layers, along with a different example that includes a Relu before the softmax. I am still a little unclear why it is dlogits_dnew_relu_feat @ dloss_dlogits rather than the other way round (dloss_dlogits @ dlogits_dnew_relu_feat) from a conceptual point of view.

JakeStevens · November 7, 2019, 8:04pm

Ah, I see now. The derivative for LogSoftMax was transposed from how I was thinking about it conceptually:

dlogits_dnew_relu_feat = torch.Tensor([[[1-sm[0][0], -sm[0][0], -sm[0][0]], [-sm[0][1], 1-sm[0][1], -sm[0][1]], [-sm[0][2], -sm[0][2], 1-sm[0][2]]]])

should be

dlogits_dnew_relu_feat = torch.Tensor([[[1-sm[0][0], -sm[0][1], -sm[0][2]], [-sm[0][0], 1-sm[0][1], -sm[0][2]], [-sm[0][0], -sm[0][1], 1-sm[0][2]]]])

and now you can do dloss_dlogits @ dlogits_dnew_relu_feat with no need for squeeze/unsqueeze