LSTM Model not training

AdityaAS · September 29, 2017, 9:10pm

Hi, I have a character level lstm model for word classification and it doesn’t seem to be training. The error rate after each epoch is almost the same (sometimes even more that the previous one!). The input tensor size is 16 x 250 x 63 (batch x seq length x alphabet size)

One hot vector encoding has been used to encode a string into a 2d matrix of size 250 x 63. Left padding is done with 0s

CrossEntropyLoss was used as the loss function

The lstm class is defined as follows:

class CharLSTM(nn.Module):

def initHidden(self):
    h0 = Variable(torch.randn(self.nlayers, self.batch_size, self.hidden_dim).cuda()) # Initial hidden state
    c0 = Variable(torch.randn(self.nlayers, self.batch_size, self.hidden_dim).cuda()) # Initial cell state
    return (h0, c0)

def __init__(self, input_size, hidden_size, nlayers, batch_size):
    super(CharLSTM, self).__init__()
    self.hidden_dim = hidden_size
    self.batch_size = batch_size
    self.nlayers = nlayers
    self.hidden = 0
    self.lstm = nn.LSTM(input_size, hidden_size)
    self.dense1 = nn.Linear(hidden_size, 2)
    
def forward(self, inputs):
    x=inputs.transpose(0,1)
    self.hidden = self.initHidden()
    x, self.hidden = self.lstm(x, self.hidden)
    x = (self.hidden[0] + self.hidden[1])/2
    x = x.squeeze(0)
    x = self.dense1(x)

    return F.softmax(x)

The training of the model is as shown below:

optimizer  = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=0.001)
start = time.time()

losses = []
test_acc = []
train_loss = []

for epoch in range(5):
    print("Epoch %d " % epoch)
    total_loss = 0
    count = 0
    print(model.parameters())
    for _, (X, Y) in enumerate(train_loader):
        optimizer.zero_grad()
        Y = Y.squeeze(1)
        probs = model(Variable(X.cuda()))
        loss = loss_function(probs, Variable(Y.cuda()))
        loss.backward()
        optimizer.step()
        total_loss += loss.data[0]
        count += 1

losses.append(total_loss/count)

accuracy = 0

Please help me out here

AdityaAS · September 29, 2017, 10:49pm

Please respond. Some sort of direction would be helpful…

KXBuzz · September 30, 2017, 2:06am

Not a solution, but I have the not training problem with autoencoders… Weird enough, I think it’s the office example Still waiting for reply…

AdityaAS · October 1, 2017, 5:51am

Hi again,

I did some checking on my side and these are my observations:

Made sure that all the tensors are wrapped inside variables to make sure the backprop function has context of history
Checked param.grad for each of the model parameters. These turned out to be really really small numbers. Order of 10^-18
Checked values of model.parameters() before and after calling “loss.backward() and optimizer.step()”. No change. Implying that backpropogation is not happening. Or it is happening and the gradient values are so small that there is not change in weights. (learning rate = 0.001 btw)

Could this be the case of vanishing gradients? LSTMs don’t have that problem right?

Please suggest solutions.

vishvakmurahari · February 20, 2018, 11:50pm

I am having a similar issue as well. Did you end up figuring what the issue was

Diego · February 21, 2018, 4:45am

I think you should initialize the cell state and hidden state to a tensor of zeros, instead of randn.

These are not like weights, if you initialize these randomly then the forget gate in the LSTM cell will try to remember stuff that it’s not relevant since this is the first timestep in your sequence.

The ‘forget gate’ is a sigmoid layer that decides what is relevant in each timestep. Outputting a value between 0 and 1, with 0 meaning ‘completely forget this’ and 1 meaning ‘this is really important so lets remember it’.

So if you use randn for the hidden state initialization. These random numbers will be fed into the forget gate and the cell state, adding noise to the LSTM cell, maybe this is what is preventing convergence in your model. But when you initialize them to 0 then the forget gate just ‘forgets’ them and the cell does not try to remember random stuff.

venkatasg · March 13, 2018, 7:09pm

Looks like I’m facing the same problem as well. Anyone figure out what the problem was?

ctyuang · March 18, 2018, 11:52am

I do encounter the same problem.

After several trial-and-error, I changed my optimizer from SGD back to the traditional GD and it worked. However, I am yet to find out why. In both cases, the learning rate is fixed as 0.1.

austin · March 19, 2018, 7:15am

I realize the op posted a long time ago but to keep people from getting confused by this.

These lines need to be removed

also these.

edit:
also remove this
self.hidden = self.initHidden()
from the forward pass, it’s reseting your hidden state to zero every time you make a forward pass.

and really just refer to the example language model for the rest! https://github.com/pytorch/examples/blob/master/word_language_model/model.py

venkatasg · March 19, 2018, 3:55pm

In the example you’ve mentioned the new hidden state is detached from previous hidden states using repackage_hidden(). Isn’t this the same as resetting hidden state to zero with each forward pass?

kakrafoon · March 19, 2018, 11:48pm

Has this problem been solved? I am encountering a similar issue with my seq2seq vae which I am almost sure is some code bug.

austin · March 20, 2018, 3:00am

No, it’s the same as hidden.detach(). It’s just a way to keep loss.backward() from computing the gradients all the way to the start of your training if that makes sense. I believe the official term for this is truncated back prop through time.

venkatasg · March 20, 2018, 2:02pm

I’m still not quite sure I understand. The hidden state for each batch which goes through forward pass is always zeros correct? So what difference would it make in terms of detaching the initial hidden state, from using a new hidden state for each new batch???

Also in this example in pyTorch Docs, init_hidden is called for each input sequence. Aren’t they resetting hidden state to zero in this case(which btw also detaches the history)?

venkatasg · March 20, 2018, 2:36pm

In any case, I’ve tried detaching the hidden state, and still I’m facing the same problem- parameters.grad is really small 10^-8. Vanishing gradients in LSTM???

austin · March 21, 2018, 2:13am

The hidden state at the start of each batch is not always zeros.

If you want to preserve some signal between batches because you are modeling one large sequence in chunks like a language model you would keep the hidden state at the end of batch_1 feed it as the initial state to batch_2 and so on. You have to detach it in between batches so that you do not compute gradients wrt to operations that happened in batch_1 while working on batch_2. In keras you can do this with the stateful=True flag.

If you don’t need any signal between batches because you are modeling many separate sequences like you would in document classification. You do not have to worry about manually feeding the hidden state back at all, at least if you aren’t using nn.RNNCell. you should use the lstm like this:

x, _ = self.lstm(x)

where the lstm will automatically initialize the first hidden state to zero and you don’t use the output hidden state at all.

again, the hidden state is not the output you should be using for classification.

x = (self.hidden[0] + self.hidden[1])/2

^-- is a large problem if you have not removed it already.

Also the loss functions in pytorch are meant to work with either logits or log_softmax. I would really recommend spending some time with the world language model again. The example is quite good.

venkatasg · March 21, 2018, 3:30pm

Hmm that makes sense, but doesn’t help with my problem. I’m trying to do sentence(or rather clause) multiclass classification. I don’t want any signal from previous batch to current batch. However, I’m still facing the problem of small gradients, which leads to terrible accuracy and F1 score(the model just predicts the most common class in the data).

kakrafoon · April 3, 2018, 6:11pm

Did you try normalizing/scaling your inputs? In my case, normalizing between 0 and 1 led to very small values of the vectors (spectrograms are weird things) and it wouldn’t learn. However, learning happens after properly scaling the inputs.