Loss does not change and weights remain zero

I read a couple of threads here but I could not resolve the issue in my code. I am a newbie in deep learning and Pytorch. I was wondering why the loss does not change and, at the end of the code, I found the weights all are zeros.

I would appreciate if you can help me with this.

def generate_minibatch (X, y): # X and y are numpy matrices
    X, y = shuffle(X, y)
    for i in range (0, X.shape [0], args.batch_size): #batch_size is equal to 128
        X_mini = X [i:i + args.batch_size]
        y_mini = y [i:i + args.batch_size]
        
        y_mini = y_mini.reshape(-1, 1)
        
        X_mini = torch.FloatTensor(X_mini)
        y_mini = torch.FloatTensor(y_mini)
        
        y_mini = y_mini.view(-1, 1)
        
        yield X_mini, y_mini
class reviewClassifier(nn.Module):
    def __init__(self):
        super(reviewClassifier, self).__init__()
        self.fc1 = nn.Linear(in_features=args.fixed_dimension, out_features=64) # fixed dimension is equal to 128
        self.fc2 = nn.Linear(in_features=64, out_features=32)
        self.fc3 = nn.Linear(in_features=32, out_features=1)
    def forward(self, x):
        x = x.view(-1, args.fixed_dimension)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        y_pred = torch.sigmoid(x)
        return y_pred
classifier = reviewClassifier()

loss_func = nn.BCELoss()

optimizer = optim.Adam(classifier.parameters(), lr=0.01)


for epoch in range(100):
    total_loss = 0 
    for X_mini, y_mini in generate_minibatch(X_train, y_train):
        
        classifier.zero_grad()
        
        y_pred = classifier.forward(x=X_mini.float())
        
        loss = loss_func(y_pred, y_mini)
        total_loss = torch.add(total_loss, loss.data)
        
        loss.backward()
        optimizer.step()
        
    if epoch % 10 == 0:
        print(total_loss)
        
for idx , param in enumerate (list(classifier.parameters())):
    print(' >> ', idx , param.grad)
print('Finished!')

Output of loss after every 10 epochs and grads (I guess these are trainable weights) are shown here:

tensor(51.2170)
tensor(43.6682)
tensor(43.6682)
tensor(43.6682)
tensor(43.6682)
tensor(43.6682)
tensor(43.6682)
tensor(43.6682)
tensor(43.6682)
tensor(43.6682)
 >>  0 tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
 >>  1 tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
 >>  2 tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
 >>  3 tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0.])
 >>  4 tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.]])
 >>  5 tensor([0.])
Finished!

The gradients of all parameters are zero, not the parameters themselves.
Most likely your model collapsed because you are using two non-linearities on the output (F.relu and torch.sigmoid).
If I remove the relu and just use sigmoid on the last layer’s output, I could successfully fit your model on some random data.

Thank you for your prompt response. It is interesting to know that we cannot use two non-linear activation functions like Relu and sigmoid together.

I changed the forward function to:

class reviewClassifier(nn.Module):
    def __init__(self):
        super(reviewClassifier, self).__init__()
        self.fc1 = nn.Linear(in_features=args.fixed_dimension, out_features=64) # fixed dimension is equal to 128
        self.fc2 = nn.Linear(in_features=64, out_features=32)
        self.fc3 = nn.Linear(in_features=32, out_features=1)
    def forward(self, x):
        x = x.view(-1, args.fixed_dimension)
        y = self.fc1(x)
        y = self.fc2(y)
        y = self.fc3(y)
        y = torch.sigmoid(y)
        return y

However, in the first epoch and after processing a couple of batches, all gradients of every layer get zero, and I got the same results as before.

Well, it could work, but note that relu will clip all values to [0, +inf], which might yield constant values if you are unlucky in your initialization.
Could you post some data (e.g. 10 samples) so that I could have a look?
I just used some random values to debug your code and it worked without the last relu.

Thank you for your response. I see your point about the combination of relu and sigmoid functions.

Some data I used for training (I reduced the number of features in my training data from 128 to 20):
X_train [20:30]:

array([[755, 234, 386, 240, 756, 546, 667,  29, 757, 171, 598, 758, 746,
        667, 331, 104, 713, 357, 685, 759],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 760,
         64, 218,  71, 423,  25, 104, 761],
       [  0,   0,   0,   0,   0,   0, 762, 254,  82,  72,  54,  16, 763,
        764, 765, 766, 277,  72,  93, 767],
       [126, 780,  91, 781, 349, 782, 783, 784, 349, 785, 786, 240,  70,
        787, 788, 789, 790,  68, 116, 791],
       [794, 795, 491, 624, 796, 328, 491, 225, 391, 360, 491, 225, 433,
        192, 797, 403, 798, 799, 190, 800],
       [815, 510, 294, 819,  86,   5, 820, 821, 379, 822, 823,  63,  64,
         65,  93,  36, 169, 437, 806, 824],
       [849,  64, 397, 850, 482, 297, 851, 852,  64, 104, 853, 765,  25,
        854,  64,  86, 298, 778, 855, 856],
       [869,  64, 870, 251, 254,  29,  20, 611, 192, 871,  30, 217, 872,
        211,  30,  25, 278, 873, 874, 875],
       [  8, 319, 892, 893, 693,  30, 523, 894, 895, 896, 897, 545,  25,
        898, 523, 899,  71, 900, 326, 901],
       [903,  30, 123, 904, 479, 102, 566, 905, 906, 907, 908, 111, 535,
        826,  46, 909, 910, 911, 912, 913]], dtype=int16)

y_train [20:30]:

array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1])
np.mean(X_train) = 1957.034
np.var(X_train) = 13721958.513
np.max(X_train) = 23637
np.min(X_train) = 0

np.mean(y_train) = 0.683
np.var(y_train) = 0.216
np.max(y_train) = 1
np.min(y_train) = 0

Thanks for the data sample.
It looks like you are dealing with some kind of indices, is that correct?
If that’s the case, I would recommend to use an nn.Embedding layer to map these indices to some dense floating point representation.
Here is a small code modification to your model:


class reviewClassifier(nn.Module):
    def __init__(self):
        super(reviewClassifier, self).__init__()
        self.emb = nn.Embedding(914, 6)
        self.fc1 = nn.Linear(in_features=20*6, out_features=64) 
        self.fc2 = nn.Linear(in_features=64, out_features=32)
        self.fc3 = nn.Linear(in_features=32, out_features=1)
    def forward(self, x):
        x = self.emb(x)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        y_pred = torch.sigmoid(x)
        return y_pred

Also you would have to pass your data as torch.LongTensors now.
If the values should represent some other features, you might want to normalize the data before feeding it to the model (e.g. such that mean=0, stddev=1).

1 Like

Thank you so much. Yes, these are indices of unique words in a corpus. Embedding works very well.
These are the losses:

tensor(39.4518)
tensor(0.7318)
tensor(0.0179)
tensor(0.0121)
tensor(0.0125)
tensor(0.0324)
tensor(0.0114)
tensor(0.0113)
tensor(0.0112)
tensor(0.0111)

I think if we don’t use embedding, the solution is logically wrong. Because if just pass the indices to the neural network (as I was doing before) we say the only thing that matters is the order of a word in a content, which is not sufficient. In my case where I want to classify Yelp reviews into two classes (positive and negative) the position of a word does not really matter. If I am not wrong, embedding (after training) is aimed to represent the word semantically, which is good for my task.

Thank you once again for your help.