Train loss doesn't decrease

Hello,

I solve problem of classification. I have a lot of keywords in text and must detected their existence in text. I used torchtext classification with embeddingbag and embedding. Accuracy was equal to 90%. But, I hope that model prediction will be 95% +.
Now, I use embedding + LSTM and have global problem with loss. Loss is not change. I was changing lr from 1000 to 0.0001, but loss change from 0.7 to 0.69.
Help me please :cry:

class TextClassificationModel_vec(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size):
        super(TextClassificationModel_vec, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0, sparse=True)
        self.lstm = nn.LSTM(input_size=embed_dim, hidden_size=hidden_size, num_layers=1, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)
        self.dropout = nn.Dropout(p=0.6)
        
    def forward(self, x):
        x = self.embedding(x)
        x = self.dropout(x)
        x, __ = self.lstm(x)
        x = self.fc(x[:, -1, :])
        return x
vocab = Vocab(c, min_freq=1)
vocab_size = len(vocab) # 11038 
emsize = 64
hidden_size = 128
model = TextClassificationModel_vec(vocab_size, emsize, hidden_size).to(device)

EPOCHS = 15000
BATCH_SIZE = 50

criterion = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

Have you tried any other optimizer? Like Adam or Adagrad?

Yes, I used Adam at first, but then i switched to sgd.

Have you tried printing out the gradient values for model.parameters() after calling loss.backward()? Maybe your initial gradient is near 0, so the model won’t update at all?

2 Likes

I have just checked. Yes indeed gradients first 10e-5 then drops to 10e-45. Can’t fix it?

Ok, so that’s the issue. It has nothing to do with the optimizer. Do you use any in-place operations? That would kill the gradients.

In train function:

total_loss += loss.item()
total_acc += (torch.round(torch.sigmoid(predicted)) == label).sum().item()

In collate.

def collate_batch(batch):
    label_list, text_list = [], []
    for _label, _text in batch:
        label_list.append(torch.FloatTensor([_label]))
        text = torch.tensor(_text, dtype=torch.int64)
        text_list.append(text)

    label_list = torch.stack(label_list)
    text_list = torch.stack(text_list)
    
    return label_list.to(device), text_list.to(device)

I am sorry, I am newbie to programming. :cold_sweat:

No need to apologize! :slight_smile:

When you calculate your gradient what variable are you back-propagating? So, are you calculating your gradients by doing total_loss.backward() ?

I think it will be more convenient this way :blush:

def train(dataloader):
    model.train()
    
    total_acc, total_count = 0, 0
    total_loss = 0
    for idx, (label, text) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted = model(text)
        loss = criterion(predicted, label)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        total_acc += (torch.round(torch.sigmoid(predicted)) == label).sum().item()
        total_count += label.size(0)
    return total_acc, total_loss, total_count 

I was reading through this tutorial on Text Classification and I can’t really see what’s wrong with your model.

One thing that could be a problem is with the cost function BCEWithLogitsLoss , after looking at the docs they pass a pos_weight variable. (which defaults to None). Perhaps this could be the issue? Maybe you’re multipling all your losses by None which would give a near 0 gradient?

Did not help :pensive:

Ah, no! :cold_sweat:

Then I don’t know what else to recommend. Perhaps there’s something of use in the tutorial I shared above? Perhaps someone else with more experience might be able to help! :slight_smile: Sorry!

1 Like

This is my first experience with LSTM. Unfortunately, not yet successful. :sweat_smile:
Thank you for your help

1 Like

I have change LSTM on GRU and it work. But, accuracy is lower than when I used only Embedding.
Embedding + Linear: acc = 0.95
Embedding + GRU + Linear: acc = 0.9
:expressionless:

1 Like

That’s pretty strange behaviour, perhaps mention this to a developer? They might have a better idea as to why it works with a GRU rather than an LSTM! :slight_smile: But glad to hear it’s working!

1 Like