Hi, I’m training a LSTM model with variable-length samples. One thing I observe is that if I sort all training data by sample length and then prepare the data loader for that, I can afford to use bigger batch sizes than without sorting them first.
Since the training data stays the same and also the max length of a padded batch also stays the same, I’m wondering if pack_padded_sequence() in the unsorted case might actually cost more GPU memory.
Say we have a padded batch after sorting by length, s*_t*
means sample*_timestep*
[s0_t0, 0, 0, 0]
[s1_t0, s1_t1, 0, 0]
[s2_t0, s2_t1, s2_t2, 0]
[s3_t0, s3_t1, s3_t2, s3_t3]
In the unsorted case it might look arbitrary in terms of padding:
[s0_t0, s0_t1, s0_t2, 0]
[s1_t0, 0, 0, 0]
[s2_t0, s2_t1, 0, 0]
[s3_t0, 0, 0, 0]
My hypothesis is in the unsorted case we tend to pad more, so we may use more GPU memory.
Is that the case here? I guess it depends on how pack_padded_sequence() is implemented in CUDA too.
Any other thoughts why GPU memory may increase?
Thanks!