Packed vs unpacked RNN sequence format for CUDA

I am exploring the different ways that RNNs can be implemented for batches of variable-length sequences. I found that PyTorch uses the packed sequence format for variable length batches.

Looking at the documentation for cuDNN 8, I noticed that the packed format appears to be supported only for backwards compatibility:

This function initializes a previously created RNN data descriptor object. This data structure is intended to support the unpacked (padded) layout for input and output of extended RNN inference and training functions. A packed (unpadded) layout is also supported for backward compatibility.

(API Reference - NVIDIA Docs)

This seems to suggest that cuDNN prefers unpacked, padded sequences instead of packed sequences. Why is this the case? Aren’t packed sequences better, since they consume less memory?

overhead from extra indirections may be not less significant than minor memory savings. it is better to measure this with tensor shares used in practice.