How to use pack_padded_sequence correctly? How to compute the loss?

vdw · February 26, 2019, 11:20am

I’m using a very simple RNN-based binary classifier for short text documents. As far as I cant tell, it works reasonable fine. The loss goess down nicely and the accuracy goes up over 80% (it plateaus after 30-40 epochs, I’m doing 100). The forward method of the classifier looks like this – the input batch X is sorted w.r.t. the their length but I don’t utilize it here:

def forward(self, X_sorted, X_length_sorted, method='last_step'):
    X = self.word_embeddings(X_sorted)
    X = torch.transpose(X, 0, 1)
    X, self.hidden = self.gru(X, self.hidden)
    X = X[-1]
    # A series of fully connected layers
    for l in self.linears: 
        X = l(X)
    return F.log_softmax(X, dim=1)

Naturally, the length of sequences can vary between a minimum length of 5 and maximum length of 5. Now I wanted to see how the packing and padding of the sequences works. I therefore modified the forward method as follows:

def forward(self, X_sorted, X_length_sorted, method='last_step'):
    X = self.word_embeddings(X_sorted)
    X = torch.transpose(X, 0, 1) 
    X = nn.utils.rnn.pack_padded_sequence(X, X_length_sorted)
    X, self.hidden = self.gru(X, self.hidden)
    X, output_lengths = nn.utils.rnn.pad_packed_sequence(X)
    X = X[-1]
    # A series of fully connected layers
    for l in self.linears:
        X = l(X)
    return F.log_softmax(X, dim=1)

The network still still trains, but I’ve noticed some differences

Each epoch takes about 10-15% longer to process
The loss goes down much slower (using the same learning rate)
The accuracy goes up to only about 70% (it plateaus after 30-40 epochs, I’m doing 100)

I also found to change nn.NLLLoss() to nn.NLLLoss(ignore_index=0) with 0 being the padding index. Again, it trains, but the loss goes down almost crazily fast (even with a much smaller learning rate) and the accuracy won’t change at all. I still somehow feel that the calculation of the loss is an issue.

In short, it kind of works in the sense that the networks train, but I fail to properly interpret the results Am I’m missing something here or are the expected results?

Kushaj · February 26, 2019, 12:46pm

If you are concerned about model underfitting, you should try to overfit your model on a small mini-batch and see it the accuracy goes to 100%. If the model is not able to overfit then, you are underfitting.

vdw · February 27, 2019, 10:15am

Thanks! Using your idea I was able to drill down to the problem. Using a very small dataset, I could immediately overfit (training accuracy=100%) the model if I don’t use packing – didn’t happen when I used packing, initially. I finally got it to work with packing when I used a batch size of 1.

I’m pretty sure now, that I cannot simply used packing and X=X[-1] to get the last output. When batch size = 1 the GRU output dimension is (seq_len, batch_size, dim) where seq_len is only the length of the sequence without padding. If I have larger batches seq_len is length of the longest sequence in the batch. So when I do X=X[-1] I get meaningless output for all shorter sequences that have padding. I could confirm this by making sure that all my sequences in my mini dataset have no padding. Then I could overfit my model even with packing.

My current solution is there not last output but the final hidden state of the RNN. For this, used the approach outlined here. Not 100% sure if this is the (most) correct way, but now I can train my model on the original dataset with packing and get the expected test accuracy of 80%.

vdw · February 27, 2019, 11:52pm

After consulting the PyTorch Docs a bit longer and seeing some other code examples, I post below my current forward function. Maybe it’s useful for some people; I actually haven’t found that many examples for this.

def forward(self, X_sorted, X_length_sorted, method='last_step'):
    X = self.word_embeddings(X_sorted)
    X = torch.transpose(X, 0, 1)
    X = nn.utils.rnn.pack_padded_sequence(X, X_length_sorted)
    X, self.hidden = self.gru(X, self.hidden)
    X, output_lengths = nn.utils.rnn.pad_packed_sequence(X)
    final_state = self.hidden.view(self.num_layers, self.directions_count, X_sorted.shape[0], self.rnn_hidden_dim)[-1]
    if self.directions_count == 1:
        X = final_state.squeeze()
    elif self.directions_count == 2:
        h_1, h_2 = final_state[0], final_state[1]  # forward & backward pass
        #X = h_1 + h_2                # Add both states
        X = torch.cat((h_1, h_2), 1)  # Concatenate both states
    # A series of fully connected layers
    for l in self.linears:
        X = l(X)
    return F.log_softmax(X, dim=1)

Of course, the size of the first linear layer depends whether I sum or concatenate the hidden states in case of a bidirectional RNN. Using my simple dataset at the moment, both approaches work equally well, but I don’t know if one approach is generally preferable.

To recap the original problem: When using PackedSequence, one cannot simple use the last output of the RNN (in my code X=X[-1]), since the dimension of X after pad_packed_sequence is the size of the longest sequence in the batch. For shorter ones, the RNN does go that far.

marrrcin · August 27, 2019, 2:09pm

@vdw thanks for sharing your piece of code it helped me a lot!

marco_zaror · March 21, 2020, 5:23pm

Hi Chris @vdw,

I’m having issues related to yours with packed sentences and I was wondering if you can help me. I’m working on a very simple rnn model and I’ve got variable-length sentences for the input. On every example that I have seen in the past, they use rnn, gru or lstm, however I’m defining my own model so I don’t know how to use the packed sentence. Below is the relevant part of my code:

n_hidden = 200
   
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size
        self.x2h = nn.Linear(input_size, hidden_size)
        self.h2h = nn.Linear(hidden_size, hidden_size)
        self.h2o = nn.Linear(hidden_size, output_size)
        self.softmax = nn.Softmax(dim=1)
        
    def forward(self, input, hidden_):

        hidden1 = self.x2h(input)
        hidden2 = self.h2h(hidden_)
        hidden = hidden1 + hidden2
        output = self.h2o(hidden)
        output = self.softmax(output)
        
        return output, hidden
    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

rnn1 = RNN(n_vocab, n_hidden, n_vocab)

xpacked = torch.nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)

h = rnn1.initHidden()
output, hidden = rnn1(xpacked, h)

This code throws an error and I suspect that is because I’m trying to pass all the words at once. The problem is I don’t know how to access to each word in the packed and still exploits the advantages of the function.
My train function, that was working before the packing, is as follows:

def train(text_x_tensor1,label1):#, text_x_tensor2, label2):
    text_x_tensor1, label1 = text_x_tensor1.to(device), label1.to(device)
    rnn1.train()
    hidden_1 = rnn1.initHidden()
    hidden_1 = hidden_1.to(device)
    text_x_tensor1 = text_x_tensor1.permute(1,0,2)
    for i in range(len(text_x_tensor1)): #For each word
        output_1, hidden_1 = rnn1(text_x_tensor1[i], hidden_1) 
    loss1 = criterion(output_1,label1)
    optimizer1.zero_grad()
    loss1.backward()
    torch.nn.utils.clip_grad_norm_(rnn1.parameters(),1)
    optimizer1.step()
    return output_1, loss1,hidden_1

Hoping you can help me,
Marco

vdw · March 23, 2020, 3:44am

@marco_zaror I don’t think I can really help here since it require insights into the internals. I’m just the occasional users for my research work.

The problem is that you define your own RNN but are using a PyTorch data structure PackedSequence. This is arguably designed to work well with nn.LSTM and nn.GRU. Sure, in principle, it should be able to use it in a custom fashion, but I have no idea how. The questions is also if it’s worth the effort and re-invent the wheel – I understand, of course, that you’re (partly) doing this for education/understanding.

To be honest, I would ignore that issue. Just use the BucketIterator that creates batches where all sequences within a batch have the same or at least very similar length. Even if there’s padding, it’s minimal, so it arguably won’t have any negative effects. Or enforce batch with sequences of equal length; see this thread.

marco_zaror · March 23, 2020, 5:54pm

Thank you very much for your answer Chris @vdw. Your idea is actually easier so thank you for that!
If you let me, can I ask you one more question? I’ve been reading a lot about BPTT and now I’m confused with something else. As it can be seen on my train function, I compute the forward pass of my model n times (where n is the number of words in the sentence). After that, I only compute the loss (and back propagate) for the last word of the sentence, is that ok in your opinion?
I’ve seen math explanations, and they imply that I should compute the loss for every word in the sentence and then average that loss (In other words, take the word 1 and predict word 2, then take words 1 and 2 and predict word number 3 and so on), but I’m not sure about that aproach.

Can I have your thoughts about it? I’m really sorry is the question is too basic…

vdw · March 25, 2020, 1:46am

Disclaimer: The following are just my thoughts. I’m neither in expert in deep learning nor PyTorch. It’s just one area related to my research work.

I only compute the loss (and back propagate) for the last word of the sentence

I’m rather sure that you backpropagate through the whole sequence. The for loop in your train() method should build the backgraph, so when you call backward() it should consider the whole graph.

I’ve seen math explanations, […]

Not sure what you mean by this last paragraph. You use the RNN for classification, so there’s no notion of “take the word 1 and predict word 2, then take words 1 and 2 and predict word number 3 and so on”. Since sounds more like sentence generation with some kind of decoder architecture.

marco_zaror · March 25, 2020, 9:32am

That’s exactly what I was thinking. I really appreciate your thoughts Chris @vdw, thank you very much!

MRLoghmani · September 14, 2021, 10:00am

Thank you for your post!
So far, I have failed to find a full example of training a recurrent net using pack_padded_sequence. I was wondering if there is anything we need to do in the backward step or if it remains the same as what it would be without packing.