How to use pack_padded_sequence correctly? How to compute the loss?

I’m using a very simple RNN-based binary classifier for short text documents. As far as I cant tell, it works reasonable fine. The loss goess down nicely and the accuracy goes up over 80% (it plateaus after 30-40 epochs, I’m doing 100). The forward method of the classifier looks like this – the input batch X is sorted w.r.t. the their length but I don’t utilize it here:

def forward(self, X_sorted, X_length_sorted, method='last_step'):
    X = self.word_embeddings(X_sorted)
    X = torch.transpose(X, 0, 1)
    X, self.hidden = self.gru(X, self.hidden)
    X = X[-1]
    # A series of fully connected layers
    for l in self.linears: 
        X = l(X)
    return F.log_softmax(X, dim=1)

Naturally, the length of sequences can vary between a minimum length of 5 and maximum length of 5. Now I wanted to see how the packing and padding of the sequences works. I therefore modified the forward method as follows:

def forward(self, X_sorted, X_length_sorted, method='last_step'):
    X = self.word_embeddings(X_sorted)
    X = torch.transpose(X, 0, 1) 
    X = nn.utils.rnn.pack_padded_sequence(X, X_length_sorted)
    X, self.hidden = self.gru(X, self.hidden)
    X, output_lengths = nn.utils.rnn.pad_packed_sequence(X)
    X = X[-1]
    # A series of fully connected layers
    for l in self.linears:
        X = l(X)
    return F.log_softmax(X, dim=1)

The network still still trains, but I’ve noticed some differences

  • Each epoch takes about 10-15% longer to process
  • The loss goes down much slower (using the same learning rate)
  • The accuracy goes up to only about 70% (it plateaus after 30-40 epochs, I’m doing 100)

I also found to change nn.NLLLoss() to nn.NLLLoss(ignore_index=0) with 0 being the padding index. Again, it trains, but the loss goes down almost crazily fast (even with a much smaller learning rate) and the accuracy won’t change at all. I still somehow feel that the calculation of the loss is an issue.

In short, it kind of works in the sense that the networks train, but I fail to properly interpret the results Am I’m missing something here or are the expected results?

If you are concerned about model underfitting, you should try to overfit your model on a small mini-batch and see it the accuracy goes to 100%. If the model is not able to overfit then, you are underfitting.

Thanks! Using your idea I was able to drill down to the problem. Using a very small dataset, I could immediately overfit (training accuracy=100%) the model if I don’t use packing – didn’t happen when I used packing, initially. I finally got it to work with packing when I used a batch size of 1.

I’m pretty sure now, that I cannot simply used packing and X=X[-1] to get the last output. When batch size = 1 the GRU output dimension is (seq_len, batch_size, dim) where seq_len is only the length of the sequence without padding. If I have larger batches seq_len is length of the longest sequence in the batch. So when I do X=X[-1] I get meaningless output for all shorter sequences that have padding. I could confirm this by making sure that all my sequences in my mini dataset have no padding. Then I could overfit my model even with packing.

My current solution is there not last output but the final hidden state of the RNN. For this, used the approach outlined here. Not 100% sure if this is the (most) correct way, but now I can train my model on the original dataset with packing and get the expected test accuracy of 80%.

After consulting the PyTorch Docs a bit longer and seeing some other code examples, I post below my current forward function. Maybe it’s useful for some people; I actually haven’t found that many examples for this.

def forward(self, X_sorted, X_length_sorted, method='last_step'):
    X = self.word_embeddings(X_sorted)
    X = torch.transpose(X, 0, 1)
    X = nn.utils.rnn.pack_padded_sequence(X, X_length_sorted)
    X, self.hidden = self.gru(X, self.hidden)
    X, output_lengths = nn.utils.rnn.pad_packed_sequence(X)
    final_state = self.hidden.view(self.num_layers, self.directions_count, X_sorted.shape[0], self.rnn_hidden_dim)[-1]
    if self.directions_count == 1:
        X = final_state.squeeze()
    elif self.directions_count == 2:
        h_1, h_2 = final_state[0], final_state[1]  # forward & backward pass
        #X = h_1 + h_2                # Add both states
        X = torch.cat((h_1, h_2), 1)  # Concatenate both states
    # A series of fully connected layers
    for l in self.linears:
        X = l(X)
    return F.log_softmax(X, dim=1)

Of course, the size of the first linear layer depends whether I sum or concatenate the hidden states in case of a bidirectional RNN. Using my simple dataset at the moment, both approaches work equally well, but I don’t know if one approach is generally preferable.

To recap the original problem: When using PackedSequence, one cannot simple use the last output of the RNN (in my code X=X[-1]), since the dimension of X after pad_packed_sequence is the size of the longest sequence in the batch. For shorter ones, the RNN does go that far.

1 Like

@vdw thanks for sharing your piece of code it helped me a lot!