Why do we need to pack padded batches of sequences in Pytorch?

I was going through the Chatbot tutorial and saw the following code:

    def forward(self, input_seq, input_lengths, hidden=None):
        '''
        input_seq : vector of indices of the words (max_length, batch_size)

        Computation Graph:

            Convert word indexes to embeddings.
            Pack padded batch of sequences for RNN module.
            Forward pass through GRU.
            Unpack padding.
            Sum bidirectional GRU outputs.
            Return output and final hidden state.
        '''
        # Convert word indexes (10,64) to embeddings (10, 64, 500)
        embedded = self.embedding(input_seq) # (10,64,500) = (max_len, batch_size, embedding_dim=hidden_dim)
        # Pack padded batch of sequences for RNN module
        packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
        # Forward pass through GRU
        outputs, hidden = self.gru(packed, hidden)
        # Unpack padding
        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs)
        # Sum bidirectional GRU outputs
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
        # Return output and final hidden state
        return outputs, hidden

I don’t understand what the purpose of packing is. I assume is that I first don’t understand the problem these functions are trying to solve so I am going to state what I think it is and hopefully this can help people understand whats going and correct it:

Problem statement: process batches of different/variable length in Pytorch efficiently (rather than manually)

This is my understanding of what is going on and how the functions solve it: We have to process batches of DIFFERENT lengths but we don’t want the padding to be processed by our RNN modules. So we need to tell our RNN what the padding is to process it properly. This is better than processing each sequence in the batch ourselves in pytorch by looping through it. My guess is because pytorch might to some optimization (with GPUs especially) than our code (in Python) sequentially going through each sequence in the loop. Adding parallelize ourselves seems silly.

So we pack the (zero) padded sequence and the packing tells pytorch how to have each sequence when the RNN model (say a GRU or LSTM) receives the batch so that it doesn’t process the meaningless padding (since the padding is only there so that things are tensors, since we can’t have “tensors of each row having a different length”)

Is this correct? Is this why we need padding?


Crossposted:


Resources I’ve read to understand this issue:

2 Likes