Variable length sequences and pack_padded_sequence

  1. I see >5x slowdown on the .backward() call when using pack_padded_sequence (~80s instead of ~14s). Is this expected?
  2. I am currently using a stack of 2 bidirectional lstms, what is the best practice for retrieving the final states when running on variable length inputs? I did this (not sure if correct / fastest approach):
rnn_out, (ht, ct) = lstm_layer(lstm_input) 
reshaped_hidden = ht.view(num_lstm_layers, 2, batch_size, hidden_size)
back_forward_concat = torch.cat([reshaped_hidden[-1, 0, :, :], reshaped_hidden[-1, 1, :, :]], dim=1)

Thanks!

I ran torch.utils.bottleneck to try and identify the cause, but I am not sure if this is informative enough:

Without pack_padded_sequence:

With pack_padded_sequence:

Could you post code snippets pertaining to each of the two runs? It is hard to tell what is going on without them.

        concatenated = torch.cat(transformed, dim=2)
        if True:
            packed = pack_padded_sequence(concatenated, input_lengths, batch_first=True, enforce_sorted=False)
            rnnout, _ = self.core_layers.lstm(packed)  # out: tensor of shape (batch_size, seq_length, hidden_size*2)
            unpacked = pad_packed_sequence(rnnout, batch_first=True)
            final_per_seq = unpacked[0][torch.arange(unpacked[0].size(0)), unpacked[1]-1]
            out = self.core_layers.fc(final_per_seq)
        else: # I am using this to debug runtime issues of padding
            concat_normed = self.core_layers.layernorm(concatenated)
            rnnout, _ = self.core_layers.lstm(concat_normed)  # out: tensor of shape (batch_size, seq_length, hidden_size*2)
            out = self.core_layers.fc(rnnout[:, -1, :])  # this is probably the wrong way to index variable length sequences.

I have verified that this problem does not reproduce on GPU, this is a sufficient workaround for me.
Thanks.

PackedSequence is definitely more efficient on CUDA because pytorch calls into a special cudnn kernel to perform that computation.