RNN implementation in PyTorch vs Tensorflow

I’m getting started in PyTorch and have a few years experience with Tensorflow v1. I’m a bit confused about how RNNs work in PyTorch.

It seems to me that the provided RNNs in ‘nn’ are all C implementations and I can’t seem to find an equivalent to Tensorflow’s ‘scan’ or ‘dynamic_rnn’ function. Furthermore, all custom implementations of RNNs in PyTorch seem to work using Python for-loops. Wouldn’t this result in multiple calls to the GPU, which slows everything down?

Second: I am used to dealing with minibatches containing variable length sequences in Tensorflow by providing the length of each sequence to the RNN function. This has the advantage (over using an explicit pad token) that you do not need to create a useless entry in the vocabulary just for a pad token. The padded data can be any existing token which will be ignored because it lies outside of the declared length. But it seems that PyTorch assumes a pad token the same way that Keras does. Am I understanding this right? Do I have to reserve an entry in my embedding matrix for pad tokens?


I can’t comment on the first issue since I don’t know what scan and dynamic_rnn is doing. I don’t see a problem with for loops, though. I would assume that nn.LSTM and nn.GRU also have internally a loop for processing the sequence. But that’s just my assumption.

Regarding sequences with variable lengths, you can do the same in PyTorch. Search for PackedSequence and pack_padded_sequence. There are also tool like the BucketIterator the creates batch with sequences of the same or almost the same lengths.

I guess you can get around by having a padding token in your embedding matrix. If you give the lengths to the RNN any tokens beyond that get ignored anyway. On the other hand, I can’t see any harm in having a dedicated padding token. The additional space and processing overhead is negligible.

Thank you for your answer.

I was told that the JIT compiler of PyTorch makes the for-loop overhead negligible. Is this true?

Packed sequence is weird. You need to first pack your padded sequences before passing them to the RNN and then unpack the result in order to use it for something else such as word tagging. Isn’t that a large performance overhead apart from the extra memory needed for storing the packed sequence?

Also, it seems that the embedding function of PyTorch does not work the same way that Keras does, that is, it doesn’t include a mask together with the embedded words. Does this mean that the packed sequence approach is the only way to handle variable length sequences in PyTorch RNNs or is there more than one way? If that is the only way, then are there tools available to help you make a custom RNN that makes use of packed sequences?

The overhead of the for loop itself is already smaller than you think it is (at least in my experience it usually is the case that people overestimate the impact of the Python parts), but the JIT will use some unrolling to reduce it further. Two key optimizations done by the JIT are to fuse the pointwise computations in RNNs and to combine certain matrix multiplications into one.

Packed sequences are mostly a CuDNN thing, and I guess PyTorch got it from there.
For unidirectional RNNs, just keeping track of the lengths somewhere and letting the RNN run over the padding might be reasonably performant (that won’t work for bi-directional RNNs, though).
I think making RNNs with masks work with good performance in the JIT would be quite doable, but I’m not immediately aware of examples.

Best regards


1 Like

How would you make use of packed sequences in a custom RNN?

You could just slice the state and take the appropriate slice of data. This has the drawback that PyTorch cannot fuse the matmuls as nicely anymore, but it should not be unreasonable.
But my understanding was that you preferred masking…

OK then what is the recommended approach to implement a custom bi-directional RNN with variable length sequences in a minibatch?