GPU memory consumption increases for LSTM models if not sorting samples by lengths

Hi, I’m training a LSTM model with variable-length samples. One thing I observe is that if I sort all training data by sample length and then prepare the data loader for that, I can afford to use bigger batch sizes than without sorting them first.

Since the training data stays the same and also the max length of a padded batch also stays the same, I’m wondering if pack_padded_sequence() in the unsorted case might actually cost more GPU memory.

Say we have a padded batch after sorting by length, s*_t* means sample*_timestep*

[s0_t0, 0, 0, 0]
[s1_t0, s1_t1, 0, 0]
[s2_t0, s2_t1, s2_t2, 0]
[s3_t0, s3_t1, s3_t2, s3_t3]

In the unsorted case it might look arbitrary in terms of padding:

[s0_t0, s0_t1, s0_t2, 0]
[s1_t0, 0, 0, 0]
[s2_t0, s2_t1, 0, 0]
[s3_t0, 0, 0, 0]

My hypothesis is in the unsorted case we tend to pad more, so we may use more GPU memory.
Is that the case here? I guess it depends on how pack_padded_sequence() is implemented in CUDA too.

Any other thoughts why GPU memory may increase?

bump, any thoughts guys?

just want to update that it is not because of we have any GPU memory increase in LSTM layers, but the fact that it’s followed by FC layers where without sorting, the sequence length gets longer as input to FC, thus bigger FC and GPU RAM consumption.