I’d like to pass a variable length batch to an module that has
LSTMCells in it.
I can see that one can directly pass a
PackedSequence into the
LSTM layer, but there is no such functionality in
Inspecting the output of the
PackedSequence object, I can understand the way the
batch_sizes variable would be used: I’d iterate my cube feeding slices to my
LSTMCell as I usually would, passing the same input tensor but setting the batch size at each iteration. Rows outside of the batch size would then be copied forward so that the final results would be aligned.
However, I don’t understand what the shape of the
PackedSequence parameter is doing and what purpose it serves. It seems to be serializing the data and I’m not sure I understand why this is desirable, especially since what we really want is paralellized data.
I’m trying to avoid rewriting code that already exists or is likely to be written in the near future, so any guidance on this would be appreciated.
If your batch looks like this:
a b c d
e f 0 0
g 0 0 0
then passing it to pack_padded_sequence along with the lengths [4, 2, 1] will give you a PackedSequence object containing
a e g b f c d and batch sizes [3, 2, 1, 1]. This is the format that cuDNN expects. What you’re probably looking for is (something like) the code in the AutogradRNN class (https://github.com/pytorch/pytorch/tree/master/torch/nn/_functions/rnn.py), which implements an LSTM that takes a PackedSequence but uses LSTMCell rather than calling cuDNN.
I see, thanks, I didn’t know that’s what cuDNN worked with that format.
I’m not exactly sure I’m correctly reading between the lines: are
LSTMCell's inherently incapable of making use of cuDNN? Is what I’m trying to achieve simply a Bad Idea ™?
Yeah, cuDNN is a library that (among other things) lets you perform the forward or backward pass of an entire multi-layer, multi-timestep LSTM/GRU/RNN with one function call, dramatically reducing overhead relative to calling each constituent mathematical operation separately.
LSTMCell is the module that defines the set of mathematical operations that makes up an LSTM, so using a series of
LSTMCells (as done in
AutogradRNN) will have the same result as
CudnnRNN but be much slower)
One thing you can do is use cuDNN (by calling
nn.LSTM rather than
nn.LSTMCell) repeatedly with inputs of one timestep, but that will be only slightly faster at best than using
Most of the time you should just use
nn.LSTM and PyTorch will take care of things. If you want to do something unusual inside your LSTM, you won’t be able to use cuDNN anyway, so you should base your code on the code in
Thanks for the run-down James.
I’m actually developing a solution that runs through streaming data, and so originally I had designed my module with
LSTMCell's. But as you mention, I can see now that the advantages of
LSTM layers far outweigh the drawbacks.
Also, since my real time stream is much slower than my processing power, there’s really no harm in simply passing data columns one by one to the
LSTM and caching the final hidden/cell states for my next call. This is a small price to pay for an order of magnitude better performance on the training half.
LSTMCell's drawback (namely lack of cuDNN support) should be mentioned in the docs, obvious as it may be to someone familiar with the libraries.
Thanks for your time. I’m just lovin’ this library, btw.
Being just started to looking around PyTorch to implement some translation experiments, I got stuck at this point where CuDNN waits length-ordered batches. Currently I can not imagine how to proceed if I were to train a multi-source multi-target translation model where you’ll end up with N source batches of different length-ordering and M target label batches of completely different length-ordering.
Let’s say I ordered each source batch, stored their permutation indices and passed along an RNN. Now let’s say I did the same thing independently for decoder RNN’s ground-truth embeddings (inputs). How do we take care of different orderings btw sources and targets? Plus what to do in the case of attention where you need to interact source RNN outputs with decoder RNN’s hidden states?
Would it be suboptimal to convert packed RNN outputs to padded counterparts once the encoders are done with their job? I’m pretty confused.