Why Parameters's Gradient is Zero?

Hi,
In my model , some parameters’s gradient keeps as zero, which confused me a lot. Some necessary information list as below:
The main code about the parameters of zero-gradient:

###
some code
###
# about attention
alpha = F.softmax((bag_embs * self.att_a).mm(self.query_vec), 1)
bag_embs = (bag_embs * alpha).sum(0)
batch_fea.append(bag_embs)
###
some code
###

in the model definition:

self.bias_d = nn.Parameter(torch.randn(3,5))
self.query_vec = nn.Parameter(torch.randn(3,1)
self.att_a = nn.Parameter(torch.randn(3)

in the model weight init:

nn.init.xavier_uniform(self.query_vec)
nn.init.uniform(self.att_a)

And I use the Tensorboad to visualize the model parameters and its gradient:

The gradient of att_a and query_vec is always zero. while others such as bias_d is ok.
So What may cause that problem?

Thanks

1 Like

Hi,

You code sample does not show how bias_d is used so not sure how to answer.
That being said, in what you showed, if alpha is 0, then both these gradients will be 0.

I guess a good way to debug this is to print the different values during the forward to make sure they are what you expect.

Thanks for your reply. The code sample is a part of the whole model. bias_d is just a bias in one fully connected layer. And bias_d can be updated normally. But the two vectors att_a, query_vec can not be.

alpha is the attention weight , so it cannot be zero. the other parameters in the model are updated normally except for the two att-a, query_vec. It’s very strange… :joy:

Hi, I found out the problem:joy: the dim in F.softmax() should be 0 rather than 1. After changing it to 0, the gradient is normal.
Because (bag_embs * self.att_a).mm(self.query_vec) get a Variable with size K * 1 and the softmax should be applied in the first dim.
If F.softmax(..., 1), then the values in alpha are all 1.
BTW, why the gradient will become 0 when alpha is all 1.

Good that you found it !

The answer is that a softmax with a single output is a constant function that outputs 1 all the time. And the gradient of a constant function is 0.

Yeah.
Thanks very much.