Shuffle batch retrieval but not values within batch

I am working on a machine translation dataset and the input entries are sorted from the shortest sequence to the longest sequence. I pad them to the max length in each batch using collate_fn.

Is there a way to make batch creation unshuffled so that similar length entries end up in the same batch (since the data is sorted by length), but still shuffle the retrieval of batches themselves?

Take a look at e.g. this post from @vdw where he shared a similar approach.

I see, thanks for your reply!

This post is more specifically for Seq2Seq models like machine translation.

1 Like

Got it, thanks for the answer!