Shuffle batch retrieval but not values within batch

gursi261 · August 16, 2023, 3:50am

I am working on a machine translation dataset and the input entries are sorted from the shortest sequence to the longest sequence. I pad them to the max length in each batch using collate_fn.

Is there a way to make batch creation unshuffled so that similar length entries end up in the same batch (since the data is sorted by length), but still shuffle the retrieval of batches themselves?

ptrblck · August 16, 2023, 2:47pm

Take a look at e.g. this post from @vdw where he shared a similar approach.

gursi261 · August 16, 2023, 11:46pm

I see, thanks for your reply!

vdw · August 17, 2023, 12:52am

This post is more specifically for Seq2Seq models like machine translation.

gursi261 · August 17, 2023, 9:53am

Got it, thanks for the answer!