Gradients of variables/parameters before softmax becomes zero,

for eg.

import torch

a1 = torch.rand([4, 4], requires_grad=True)

b1 = a1.squeeze(0)**2

b1=F.softmax(b1,dim=1)

b1.sum().backward()

print(a1.grad)

Gives

tensor([[0., 0., 0., 0.],

[0., 0., 0., 0.],

[0., 0., 0., 0.],

[0., 0., 0., 0.]])

How to deal with this in case i need to use softmax in between layers in my model?

Thanks

I am not sure what you want to do here exactly, but the answer is theoretically correct.

You are taking a 4 x 4 matrix (a1), then squaring each value. Then you are taking the softmax along the columns. This results in each row having values that **sum to 1**. When you then sum this matrix, you basically get the result as 4.

Thus, this means, however you change the values of a1, the sum remains 4. This essentially means your are taking derivative of a constant which should be 0.