Model weights not being updated

Hi all. I get the similar problem. My bug is probably that I use the wrong combination of Softmax and Loss function, so value of grads are super small.

For me,
My model doesn’t seem to be training.
Upon checking a = list(model.parameters())[0].clone() and b = a = list(model.parameters())[0].clone() before and after the call to loss.backward() and optimizer.step(). a==b returns false
Upon printing list(model.parameters())[0].grad it returns a matrix of all super small number like in order of 10^-8.

3 Likes

Hello There!

What do you mean by the following?

I don’t understand why the gradients are super small in your case.

Hi, I am also facing the same problem. But in my case list(model.parameters())[0].grad is None. How can I find out the mistake? Any suggestions?

Thanks

In my case, it was because the Softmax function calculate the zero value tensor. So the NLL loss was unchanging and the weights were not being updated.

1 Like

I am also dealing with the same problem. I get p.grad is None for all the parameters in the network. How do I check if the computation graph is broken somewhere?

@Gkv @RGaonkar I have the same problem here.
My weights never be updated and my model list(model.parameters())[0].grad is None . And my Network didn’t return any grad_fn in tensor.
It turns out I have to return the function instead of variable. I don’t know it is correct or not but it worked for me.
Bug source code

class Model(nn.Module):
    def __init__(self, size_in, size_out):
        super(Model, self).__init__()
        self.ReLU1 = nn.ReLU()
        self.conv1 = nn.Conv2d(in_channels=size_in, out_channels=size_out,
                               kernel_size=(3, 3),
                               stride=(1, 1),
                               padding=(1, 1),
                               dilation=1,
                               groups=1,
                               bias=False)

    def forward(self, x):
        x = self.ReLU1(x)
        x = self.conv1(x)
        return x

You guys could try this:

class Model(nn.Module):
    def __init__(self, size_in, size_out):
        super(Model, self).__init__()
        self.ReLU1 = nn.ReLU()
        self.conv1 = nn.Conv2d(in_channels=size_in, out_channels=size_out,
                               kernel_size=(3, 3),
                               stride=(1, 1),
                               padding=(1, 1),
                               dilation=1,
                               groups=1,
                               bias=False)

    def forward(self, x):
        x = self.ReLU1(x)
        return self.conv1(x)

what do you mean by
verify that you are not setting requires_grad=True to all parameters of your network, as it would avoid backprop through the network

1 Like

Ignore this message. The problem was that I had too shallow of a network! Sorry about the noise! It converged after I added a couple of conv layers.

I am having a very similar issue. Initially, I had a 3 conv+relu+maxpool followed by 2 linears + relu and sigmoid for binary classification. It would not learn – all the parameters would be set to something very close to 0, and the output would be “random”. I reduced the size of the network to Conv/relu + Linear/relu + Linear/sigmoid => same problem.

1 Like

similar problem faced by me as I was using relu in the layer and then apply softmax on the layer and then cross entropy so make sure that last layer should not contain any activation function if you are using softmax

1 Like

Hi! I’ve had the same problem. Loss was just the same.
This is my model:

class Model(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, n_layers):
        super().__init__()

        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.lstm = torch.nn.LSTM(input_size=input_size,
                                  hidden_size=hidden_size,
                                  num_layers=n_layers,
                                  dropout=0.3,
                                  batch_first=True)

        self.out = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()

    def forward(self, input_arr, hidden):

        output, (hidden_state, cell_state) = self.lstm(input_arr, hidden)
        output = self.out(output)

        # output = self.relu(output)  <-- Here was the problem

        return output, (hidden_state, cell_state)

I think, that model with random weights predicts a lot of negative values. But ReLU cut them out. So, MSELoss can’t do a lot with them. Weights aren’t updated.

But when I removed ReLU activation, it was just fine.

P.S. Thanks everybody who contributed in this discussion. It helped me a lot!

1 Like

Very thanks to you Sir.
The hint was really promising.

I am facing the similar problem loss of my model is not getting updated
my model defination:
class DocLSTM(nn.Module):
def init(self, vocab_size, in_dim, mem_dim, sparsity, freeze):
super(DocLSTM, self).init()
self.emb = nn.Embedding(vocab_size, in_dim, padding_idx=Constants.PAD, sparse=sparsity)
if freeze:
self.emb.weight.requires_grad = False
self.body_LSTM = nn.LSTM(150, 150, 1)
self.Para_LSTM = nn.LSTM(150, 150, 1)
self.Headline_LSTM = nn.LSTM(300, 150, 1)
self.childsumtreelstm = ChildSumTreeLSTM(in_dim, mem_dim)
torch.manual_seed(0)
self.sent_pad = torch.randn(1, 150)

    self.para_pad = torch.randn(1, 1, 150)
    self.word_pad= torch.randn(1,300)

when i run following code after loss.backward
list(self.model.parameters())[0].grad
list(self.doclstm.parameters())[0].grad
list(self.sent.parameters())[0].grad
for first two model .grad return none but for last it return the tensor of parameters. i am not able to figure out what the actual problem

i am using view to change dimension here
lstate, lhidden = self.childsumtreelstm(ltree, linputs,h_seq_hed.view(1,150),1)
is that the reason parameters are not being updated?

please help thank you in advance

Isn’t it the same thing only?

1 Like

you meant requires_grad=False, right?

I have the same problem my weight could not update , and all weight layer have requires_grad=True
can you help me:

weight layer: torch.Size([32]) Parameter containing:
tensor([1.0001, 1.0001, 0.9999, 1.0001, 1.0001, 0.9999, 0.9999, 0.9999, 1.0001,
1.0001, 1.0001, 0.9999, 0.9999, 0.9999, 1.0001, 0.9999, 0.9999, 0.9999,
1.0001, 1.0001, 0.9999, 1.0001, 1.0001, 1.0001, 0.9999, 0.9999, 1.0001,
0.9999, 0.9999, 0.9999, 1.0001, 1.0001], requires_grad=True)
weight layer torch.Size([32]) Parameter containing:
tensor([ 6.5435e-05, 6.5572e-05, -6.5296e-05, 6.6796e-05, 6.5458e-05,
-6.5342e-05, -6.5205e-05, -6.5350e-05, 6.5402e-05, 6.5721e-05,
6.5457e-05, -6.5321e-05, -6.5283e-05, -6.5280e-05, 6.5410e-05,
-6.5351e-05, -6.5342e-05, -6.5063e-05, 6.5461e-05, 6.5426e-05,
-6.5271e-05, 6.5417e-05, 6.5417e-05, 6.5462e-05, -6.5357e-05,
-6.5296e-05, 6.5392e-05, -6.5317e-05, -6.5294e-05, -6.5305e-05,
6.5454e-05, 6.5511e-05], requires_grad=True)

Could you explain your use case a bit more and post a (minimal) executable code snippet to reproduce this issue, please?

1 Like

i run REDWGAN model on cpu ,
the layer weight updates after 40 iteration was slow with adam optimizer , the weights of generator was updated slightly , it was initially :
this is the one of the layer weights

[[ 7.1380e-02, 1.3541e-02, -7.7020e-02],
[-3.7415e-02, -3.7045e-02, 3.1205e-02],
[ 8.8570e-03, -8.4177e-02, 5.3708e-03]]]]], requires_grad=True)
and after 40 iteration ,it was updated as

      [[ 7.1350e-02,  1.3510e-02, -7.7051e-02],
       [-3.7445e-02, -3.7075e-02,  3.1175e-02],
       [ 8.8266e-03, -8.4208e-02,  5.3404e-03]]]]], requires_grad=True)

and the two images,the input and denoised from REDWGAN are the same

image

Based on your description, the weights are indeed being updated, so it doesn’t seem to be a issue of static (detached) weights.
If you are concerned about the gradient magnitude (i.e. the weight updates themselves), you could try to play around with some hyperparameters, such as increasing the learning rate etc.

2 Likes

I’ve already changed the learning rate.

in my case this didn’t help.