Loss remains constant in training the network

ayush1999 · March 30, 2018, 4:36pm

Here is my network:

import torch.nn as nn
from torch.autograd import Variable

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input_size , hidden_size)
        self.h2o = nn.Linear(hidden_size, output_size)
        self.h2h = nn.Linear(hidden_size, hidden_size)
        self.Relu = nn.ReLU()
        self.softmax = nn.LogSoftmax(dim = 1)

    def forward(self, input, hidden):
        h = self.Relu(self.h2h(hidden)+  self.i2h(input))
        o = self.softmax(self.h2o(h))
        return o, h

    def init_hidden(self):
        return Variable(torch.zeros(1, self.hidden_size), requires_grad=True)

rnn = RNN(n_chars, 90, n_chars)
criterion = nn.L1Loss()
learning_rate = 0.05
optimizer = torch.optim.Adam(rnn.parameters(), lr = learning_rate)
hidden = rnn.init_hidden()
epochs = 5

rnn.cuda()
for epoch in range(epochs):
    for i in range(len(X)):
        for ele in X[i]:
            output, hidden = rnn(Variable(ele.t()).cuda(), hidden.cuda())
        loss = criterion(output, Variable(Y[i]).cuda())
        
        loss.backward(retain_graph=True)
        optimizer.zero_grad()
        optimizer.step()
        if (i%10 == 0):
            print(loss)

The loss I get is approximately constant at 4.513. Why is the loss not changing?

ptrblck · March 30, 2018, 4:45pm

You are deleting the gradients after they were computed and before the weight updates were performed.
Try to move optimizer.zero_grad(), e.g.:

for epoch in range(epochs):
    for i in range(len(X)):
        optimizer.zero_grad()
        for ele in X[i]:
            output, hidden = rnn(Variable(ele.t()).cuda(), hidden.cuda())
        loss = criterion(output, Variable(Y[i]).cuda())
        
        loss.backward(retain_graph=True)
        optimizer.step()

ayush1999 · March 30, 2018, 4:47pm

What exactly does optimizer.zero_grad() do?

ptrblck · March 30, 2018, 4:49pm

It sets all gradients to zero, i.e. is basically deletes all gradients from the Parameters, which were passed to the optimizer.
You need it, because the gradients won’t be cleared otherwise and thus they will be accumulated in each iteration.

ayush1999 · March 30, 2018, 4:52pm

I shifted the optimizer.zero_grad() above, but the loss is still constant.

ayush1999 · March 30, 2018, 4:57pm

When I remove the optimizer completely, the loss remains exactly constant at 4.5315. I have this feeling that the weight update isn’t happening.

ptrblck · March 30, 2018, 5:05pm

It’s probably not the error, but you should call .cuda on the Tensor before wrapping it in a Variable (for example in this case Variable(Y[i]).cuda().
Could you check the gradients with rnn.i2h.weight.grad?
Also could you provide the shapes of X and Y?

ayush1999 · March 30, 2018, 5:12pm

print(rnn.i2h.weight.grad) gives me a 90x90 matrix consisting of all nan values. Also, each element of X i.e X[i] is a vector of length 90. Y is also a vector of length 90.

ayush1999 · March 30, 2018, 5:34pm

Why are they nan values?

ptrblck · March 30, 2018, 8:55pm

Do you see these values from the beginning of your training?

ayush1999 · March 31, 2018, 9:09am

Yeah they were nan from the very beginning. Why?

DongDong_Chen · August 26, 2019, 8:26am

Have you found the reason?