LSTM model does not learn

I am trying this in IMDB dataset here is my result as a kaggle notebook

I am trying to make a many-to-one LSTM model. And I am using bert-base-uncased tokenizer. But the weird thing is, I can overfit on one single batch. But the model does not learn the entire dataset. Can you guys hint me what is wrong here?

Here is the dataset:

tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-uncased')

class ImdbDataset(
    def __init__(self, df, tokenizer):
        self.df = df
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        review = self.df.iloc[idx]["review"]
        label = self.df.iloc[idx]["sentiment"]
        tokens = self.tokenizer(review, padding="max_length", add_special_tokens=False, truncation=True, max_length=256, return_tensors="pt")
        label = torch.tensor(1 if label == "positive" else 0)
        return tokens["input_ids"][0], label

Here is the model:

class Model(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers):
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(0.3)

    def forward(self, x):
        x = self.embedding(x)
        x, _ = self.rnn(x)
        x = x[:, -1, :]
        x = self.dropout(x)
        x = self.fc(x) 
        x = x.squeeze() # (batch_size, 1) -> (batch_size)
        x = torch.sigmoid(x)
        return x

model = Model(vocab_size=len(tokenizer), embedding_dim=400, hidden_dim=128, output_dim=1, n_layers=2)

And here is the training loop, if that helps:

epochs = 8
clip = 5 # what is this number ?
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()
history = {
    'train_loss': [],
    'train_acc': [],
    'val_loss': [],
    'val_acc': [],

for epoch in range(epochs):
    train_loss = 0
    train_acc = 0
    for i, (inputs, labels) in enumerate(train_dataloader):
        inputs, labels =,
        output = model(inputs)
        loss = criterion(output, labels.float())
        train_loss += loss.item()
        train_acc += torch.sum(torch.round(output) == labels).item()
        nn.utils.clip_grad_norm_(model.parameters(), clip)

        if i % 100 == 0:
            print(f"Train Epoch: {epoch}, Iteration: {i}, Loss: {loss.item()}, ")

    train_loss /= len(train_dataloader)
    train_acc /= len(train_dataloader.dataset)

    val_loss = 0
    val_acc = 0

    for i, (inputs, labels) in enumerate(test_dataloader):
        inputs, labels =,
        output = model(inputs)
        loss = criterion(output, labels.float())
        val_loss += loss.item()
        val_acc += torch.sum(torch.round(output) == labels).item()
        if i % 100 == 0:
            print(f"Valid Epoch: {epoch}, Iteration: {i}, Loss: {loss.item()}, ")

    val_loss /= len(test_dataloader)
    val_acc /= len(test_dataloader.dataset)


Thanks for reading.

I can’t see spot any obvious problem with your module.

Since you say you can overfit on a single batch, did you try to increase the dataset size step by step (e.g., 1%, 5%, 10%) it see what happens?

One question I have is are you using padding in any way? When you grad the last hidden output like this: x = x[:, -1, :], you might be grabbing zeros (if this sentence was padded) or the true last hidden state for this sentence (if the sentence was equal to the longest sentence in the batch). This question is older so, did you figure it out?

1 Like

hello @dreidizzle

I am using padding but it is inside the tokenizer:

tokens = self.tokenizer(review, padding="max_length", add_special_tokens=False, truncation=True, max_length=256, return_tensors="pt")

Also I think, the x = x[:, -1, :] just captures the last hidden state from lstm. Do you think it should be x = x[:, :,-1] ?

Ok i will try it.

So let’s say each of your batches contain sentences and the dimensions of your data is N X L X D where N = batch size, L = max length of sentence and D = dimension of word vector. If you do x[:, -1, :] you are indeed pulling the last element in each sentence. But, what I’m saying is that if your batch has N = 4 and lengths [2, 3, 4, 5] so that L = 5 and the first 3 sentences had 3, 2, and 1 0’s added as padding. Then, x[3, -1, :] is the real last hidden element for the sentence of length 5 but for the other sentences stuff like x[0, -1, :] is not really the last hidden state you want, right? You’d want x[0, 1, :], x[1, 2, :] and x[2, 3, :] and you have x[3, 4, :] = x[3, -1, :]. I think this post adresses this issue, but I am not sure if this is your issue, just that it could be: deep learning - Why do we "pack" the sequences in PyTorch? - Stack Overflow