Using PackedSequence with LSTMCell

I’d like to pass a variable length batch to an module that has LSTMCells in it.

I can see that one can directly pass a PackedSequence into the LSTM layer, but there is no such functionality in LSTMCell.

Inspecting the output of the PackedSequence object, I can understand the way the batch_sizes variable would be used: I’d iterate my cube feeding slices to my LSTMCell as I usually would, passing the same input tensor but setting the batch size at each iteration. Rows outside of the batch size would then be copied forward so that the final results would be aligned.

However, I don’t understand what the shape of the data in PackedSequence parameter is doing and what purpose it serves. It seems to be serializing the data and I’m not sure I understand why this is desirable, especially since what we really want is paralellized data.

I’m trying to avoid rewriting code that already exists or is likely to be written in the near future, so any guidance on this would be appreciated.


If your batch looks like this:

a b c d
e f 0 0
g 0 0 0

then passing it to pack_padded_sequence along with the lengths [4, 2, 1] will give you a PackedSequence object containing a e g b f c d and batch sizes [3, 2, 1, 1]. This is the format that cuDNN expects. What you’re probably looking for is (something like) the code in the AutogradRNN class (, which implements an LSTM that takes a PackedSequence but uses LSTMCell rather than calling cuDNN.


I see, thanks, I didn’t know that’s what cuDNN worked with that format.

I’m not exactly sure I’m correctly reading between the lines: are LSTMCell's inherently incapable of making use of cuDNN? Is what I’m trying to achieve simply a Bad Idea ™?

Yeah, cuDNN is a library that (among other things) lets you perform the forward or backward pass of an entire multi-layer, multi-timestep LSTM/GRU/RNN with one function call, dramatically reducing overhead relative to calling each constituent mathematical operation separately. LSTMCell is the module that defines the set of mathematical operations that makes up an LSTM, so using a series of LSTMCells (as done in AutogradRNN) will have the same result as CudnnRNN but be much slower)

One thing you can do is use cuDNN (by calling nn.LSTM rather than nn.LSTMCell) repeatedly with inputs of one timestep, but that will be only slightly faster at best than using LSTMCell.

Most of the time you should just use nn.LSTM and PyTorch will take care of things. If you want to do something unusual inside your LSTM, you won’t be able to use cuDNN anyway, so you should base your code on the code in AutogradRNN.


Thanks for the run-down James.

I’m actually developing a solution that runs through streaming data, and so originally I had designed my module with LSTMCell's. But as you mention, I can see now that the advantages of LSTM layers far outweigh the drawbacks.

Also, since my real time stream is much slower than my processing power, there’s really no harm in simply passing data columns one by one to the LSTM and caching the final hidden/cell states for my next call. This is a small price to pay for an order of magnitude better performance on the training half.

Perhaps the LSTMCell's drawback (namely lack of cuDNN support) should be mentioned in the docs, obvious as it may be to someone familiar with the libraries.

Thanks for your time. I’m just lovin’ this library, btw.

Being just started to looking around PyTorch to implement some translation experiments, I got stuck at this point where CuDNN waits length-ordered batches. Currently I can not imagine how to proceed if I were to train a multi-source multi-target translation model where you’ll end up with N source batches of different length-ordering and M target label batches of completely different length-ordering.

Let’s say I ordered each source batch, stored their permutation indices and passed along an RNN. Now let’s say I did the same thing independently for decoder RNN’s ground-truth embeddings (inputs). How do we take care of different orderings btw sources and targets? Plus what to do in the case of attention where you need to interact source RNN outputs with decoder RNN’s hidden states?

Would it be suboptimal to convert packed RNN outputs to padded counterparts once the encoders are done with their job? I’m pretty confused.