Tensorflow-esque bucket by sequence length

Good question!

Essentially, it’s just for convenience; the model is agnostic to the sequence lengths. At least for training, as the lengths of the output sequences are known. But you’re right, predicting only works 1 sequences at a time and no longer in batches since the output sequence are likely to differ in lengths.

I actually implemented bucketing for Seq2Seq and use it all the time…again, simply for convenience and performance.