PyTorch pack_padded_sequence is really slow

I am building a GRU-based architecture. Before, I was just padding the batches of sequences and passing it to the GRU. Obviously, that was introducing some small error in the results because it’s not quite the 100% correct thing to do (the GRU doesn’t know to stop when it reaches the padding elements).

Thus I switched out the naive batch of 2d padded sequences for pack_padded_sequence, so that I’m not passing extraneous padding items to the GRU. The training time increased by at least 3x. I am doing the pack_padded_sequence on GPU, so I need to check if perhaps it’s just inefficient to do on GPU? When doing pack_padded_sequence with word embeddings, is it preferable to embed first and then pack, or pack first and then embed? I have been doing the former and not sure if it’s contributing to a slowdown.

Any suggestions would be appreciated!

Usually, packing should be fast enough, but here are a couple of considerations:

  • Re “it’s not quite the 100% correct thing to do” given that output and final state are the same, you could just read off the input_len - 1th output of each sequence,
  • Usually, packing should not be that slow, but I would only expect a speedup when using CuDNN. One of the things to know is that you probably want pre-computed lengths and absolutely not compute them on the fly from the padding.
  • I’d probably pack first and then embed for efficiency reasons, but it may be that you need to do the application of the embedding manually (There are many threads on “elementwise” operations on packed sequences.)

Best regards

Thomas

Thanks Thomas! How would you handle the bidirectional RNN case? I realized I could take input_len - 1 without issue for the forward pass, but the padding in the other direction would most definitely corrupt the results (would see a bunch of padding as the first couple of inputs to the backward RNN). I tried building a separate dedicated backward RNN to complement the forward RNN, but it is not efficient and not preferred.

I will check if CuDNN is properly installed on the machine.

I am familiar with how to apply the embed on a packed sequence, it’s a little hacky but you essentially need to rebuild the packed sequence after modifying the packed sequence’s .data attribute via embed(…) call

As far as I remember, the bidirectional case is tricky without pack_padded_sequence, unfortunately. I could imagine using JIT + a mask, but one would have to see exactly how to get the JIT to generate good code for it and it would be dependent on the PyTorch version.

Best regards

Thomas