I have a project on NLP multi-class classification (4 classes) with the biLSTM network. I use standard cross-entropy loss as a loss function and Adam optimizer. Unfortunately, the model does not learn and I would appreciate it if someone could suggest a model improvement.
The output shape of hidden is (num_directions*num_layers, batch_size, hidden_size) which means you have to be careful with indexing when using a Bi-LSTM/GRU with multiple layers. I would suggest to first separate num_directions and num_layers.
Here’s a snippet of my own code. It’s a bit verbose since I support LSTM and GRU as well as unidirectional and bidirectional.
# Push through RNN layer
rnn_output, self.hidden = self.rnn(X, self.hidden)
# Extract last hidden state
if self.params.rnn_type == RnnType.GRU:
final_state = self.hidden.view(self.params.num_layers, self.num_directions, batch_size, self.params.rnn_hidden_dim)[-1]
elif self.params.rnn_type == RnnType.LSTM:
final_state = self.hidden[0].view(self.params.num_layers, self.num_directions, batch_size, self.params.rnn_hidden_dim)[-1]
# Handle directions
final_hidden_state = None
if self.num_directions == 1:
final_hidden_state = final_state.squeeze(0)
elif self.num_directions == 2:
h_1, h_2 = final_state[0], final_state[1]
# final_hidden_state = h_1 + h_2 # Add both states (requires changes to the input size of first linear layer + attention layer)
final_hidden_state = torch.cat((h_1, h_2), 1) # Concatenate both states
Hi Chris @vdw,
Thank you for your suggestion. Since I didn’t get the whole idea, could you be so kind as to direct me to some implementation of your Bi-LSTM/GRU model on the multiclass text classification task? Maybe I overlooked it, but I could not see it on the GitHub link.
Well, the code for the model is all in this file I already linked to in the previous post. An example usage is then as follows:
from pytorch.models.text.classifier.rnn import RnnClassifier, RnnType, AttentionModel, Parameters
# Check if GPU available
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
# Configure the RNN model
params = { 'rnn_type': RnnType.LSTM,
'rnn_hidden_dim': 512,
'num_layers': 2,
'bidirectional': True,
'dropout': 0.2,
'vocab_size': max_idx+1,
'embed_dim': 300,
'linear_dims': [200, 100],
'label_size': len(label_set),
'clip': 0.5,
'attention_model': AttentionModel.DOT }
params = Parameters(params)
model = RnnClassifier(device, params)
And the I use the model in my training loop as usual. Just some comments:
the Parameters class is just for convenience; it converts the dictionary of parameters into a class with all parameters as class variables.
max_idx here is the largest index in my word list, making max_idx+1 the size of the vocabulary
the configuration example above creates a 2-layer Bi-LSTM with a 512-dim hidden representation; the word embedding size is 300. The output if the last Bi-LSTM layer (and the last hidden state of the sequence) is pushed through 2 linear linear layers of size 300 and 200, before finally pushed through a linear layer of the output size.