NLP multi-class text classification with biLSTM network - model does not learn


I have a project on NLP multi-class classification (4 classes) with the biLSTM network. I use standard cross-entropy loss as a loss function and Adam optimizer. Unfortunately, the model does not learn and I would appreciate it if someone could suggest a model improvement.

The model looks like this:

import torch.nn as nn
import torch.nn.functional as F

class LSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        super(LSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        self.lstm = nn.LSTM(input_size = embedding_dim, 
                            hidden_size = hidden_dim, 
                           batch_first = True)
        self.fc1 = nn.Linear(hidden_dim * 2, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, 4)
        self.relu = nn.ReLU() 
        self.dropout = nn.Dropout(dropout)
    def forward(self, text, text_lengths):
        embedded = self.embedding(text)

        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths, batch_first=True)
        packed_output, (hidden, cell) = self.lstm(packed_embedded)   
        cat =[-2, :, :], hidden[-1, :, :]), dim=1)
        rel = self.relu(cat)
        dense1 = self.fc1(rel)
        drop = self.dropout(dense1)
        output = self.fc2(drop)

        return output

Thank you

I’m not sure, but this might(!) cause issues

The output shape of hidden is (num_directions*num_layers, batch_size, hidden_size) which means you have to be careful with indexing when using a Bi-LSTM/GRU with multiple layers. I would suggest to first separate num_directions and num_layers.

Here’s a snippet of my own code. It’s a bit verbose since I support LSTM and GRU as well as unidirectional and bidirectional.

        # Push through RNN layer
        rnn_output, self.hidden = self.rnn(X, self.hidden)

        # Extract last hidden state
        if self.params.rnn_type == RnnType.GRU:
            final_state = self.hidden.view(self.params.num_layers, self.num_directions, batch_size, self.params.rnn_hidden_dim)[-1]
        elif self.params.rnn_type == RnnType.LSTM:
            final_state = self.hidden[0].view(self.params.num_layers, self.num_directions, batch_size, self.params.rnn_hidden_dim)[-1]
        # Handle directions
        final_hidden_state = None
        if self.num_directions == 1:
            final_hidden_state = final_state.squeeze(0)
        elif self.num_directions == 2:
            h_1, h_2 = final_state[0], final_state[1]
            # final_hidden_state = h_1 + h_2               # Add both states (requires changes to the input size of first linear layer + attention layer)
            final_hidden_state =, h_2), 1)  # Concatenate both states

The full code is here.

Hi Chris @vdw,
Thank you for your suggestion. Since I didn’t get the whole idea, could you be so kind as to direct me to some implementation of your Bi-LSTM/GRU model on the multiclass text classification task? Maybe I overlooked it, but I could not see it on the GitHub link.

Thank you

Well, the code for the model is all in this file I already linked to in the previous post. An example usage is then as follows:

from pytorch.models.text.classifier.rnn import RnnClassifier, RnnType, AttentionModel, Parameters

# Check if GPU available
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

# Configure the RNN model
params = { 'rnn_type': RnnType.LSTM,
           'rnn_hidden_dim': 512,
           'num_layers': 2,
           'bidirectional': True,
           'dropout': 0.2,
           'vocab_size': max_idx+1,
           'embed_dim': 300,
           'linear_dims': [200, 100],
           'label_size': len(label_set),
           'clip': 0.5,
           'attention_model': AttentionModel.DOT }

params = Parameters(params)

model = RnnClassifier(device, params)

And the I use the model in my training loop as usual. Just some comments:

  • the Parameters class is just for convenience; it converts the dictionary of parameters into a class with all parameters as class variables.
  • max_idx here is the largest index in my word list, making max_idx+1 the size of the vocabulary
  • the configuration example above creates a 2-layer Bi-LSTM with a 512-dim hidden representation; the word embedding size is 300. The output if the last Bi-LSTM layer (and the last hidden state of the sequence) is pushed through 2 linear linear layers of size 300 and 200, before finally pushed through a linear layer of the output size.

I hope that helps.