- I see >5x slowdown on the .backward() call when using pack_padded_sequence (~80s instead of ~14s). Is this expected?
- I am currently using a stack of 2 bidirectional lstms, what is the best practice for retrieving the final states when running on variable length inputs? I did this (not sure if correct / fastest approach):
rnn_out, (ht, ct) = lstm_layer(lstm_input)
reshaped_hidden = ht.view(num_lstm_layers, 2, batch_size, hidden_size)
back_forward_concat = torch.cat([reshaped_hidden[-1, 0, :, :], reshaped_hidden[-1, 1, :, :]], dim=1)
I ran torch.utils.bottleneck to try and identify the cause, but I am not sure if this is informative enough:
Could you post code snippets pertaining to each of the two runs? It is hard to tell what is going on without them.
concatenated = torch.cat(transformed, dim=2)
packed = pack_padded_sequence(concatenated, input_lengths, batch_first=True, enforce_sorted=False)
rnnout, _ = self.core_layers.lstm(packed) # out: tensor of shape (batch_size, seq_length, hidden_size*2)
unpacked = pad_packed_sequence(rnnout, batch_first=True)
final_per_seq = unpacked[torch.arange(unpacked.size(0)), unpacked-1]
out = self.core_layers.fc(final_per_seq)
else: # I am using this to debug runtime issues of padding
concat_normed = self.core_layers.layernorm(concatenated)
rnnout, _ = self.core_layers.lstm(concat_normed) # out: tensor of shape (batch_size, seq_length, hidden_size*2)
out = self.core_layers.fc(rnnout[:, -1, :]) # this is probably the wrong way to index variable length sequences.
I have verified that this problem does not reproduce on GPU, this is a sufficient workaround for me.
PackedSequence is definitely more efficient on CUDA because pytorch calls into a special cudnn kernel to perform that computation.