I am attempting to re-implement backpropagation on my own for didactic purposes, but am running into some issues. I am trying to work backwards from a simple network, starting with LogSoftmax + NLLLoss, but I am unable to match the calculated gradient of the input to the LogSoftmax layer as calculated by autograd.
import torch
import torch.nn
new_relu_feats = torch.Tensor([[1,0,3]])
new_relu_feats.requires_grad = True
logits = torch.nn.LogSoftmax(dim=1)(new_relu_feats)
logits.retain_grad()
label = torch.LongTensor([1])
loss = torch.nn.NLLLoss()(logits, label)
loss.backward()
sm = torch.nn.Softmax(dim=1)(new_relu_feats)
dloss_dlogits = torch.Tensor([[0,-1,0]])
logits.grad # This matches above
dlogits_dnew_relu_feat = torch.Tensor([[[1-sm[0][0], -sm[0][0], -sm[0][0]], [-sm[0][1], 1-sm[0][1], -sm[0][1]], [-sm[0][2], -sm[0][2], 1-sm[0][2]]]])
dloss_dlogits * dlogits_dnew_relu_feat
# This does not entirely match above, but the middle column does (corresponding to the correct class)
new_relu_feats.grad
Is this correct, but the matrix is eventually reshaped for efficiency by simply selecting the portions that are non-zero (aka corresponding to the correct class)?
You can use triple backticks ``` before and after your code to have nicer formatting.
The combination of gradients if not an element wise product but a matrix matrix multiplications.
You can do torch.bmm(dlogits_dnew_relu_feat, dloss_dlogits.unsqueeze(-1)).squeeze(-1) to get what you want.
The unsqueeze/squeeze in the last dimension is just to have a dummy dimension of size 1 to make bmm happy
Thank you so much! That did the trick and I can match these layers, along with a different example that includes a Relu before the softmax. I am still a little unclear why it is dlogits_dnew_relu_feat @ dloss_dlogits rather than the other way round (dloss_dlogits @ dlogits_dnew_relu_feat) from a conceptual point of view.