Mini batch shape for nlp

I was thinking on mini batch shape.
I guess it should be a matrix:
sequence_length * number_of_batches?

Is there any paper that explains mini batches simple and clear?

I don’t know of any such paper. But one way of overcoming your doubts is to use a single example as input and use various print statements to see what shapes do I want, what shapes to give as input.

Strange, I dug deep and still haven’t found interesting material, just as you mentioned.
I would like if I can also find some insights on variable sequence_length since this is something I found also possible.

Thanks for the feedback!

As far as variable_sequence length is concerned, there are two ways to resolve this

  1. Pad the sequences with at the end. Using this can lead to some inefficiencies as if one text is vary long than a lot of 's are required for smaller ones.
  2. Sort the sequences by their length.

Combining both the above techniques is the best practice for variable sequence_length

Hi. There is a technique called variable sequence length. It assumes no padding. This is what I meant, to generally understand the idea of mini beaches.

But from a performance point of view I think it is not that good. Also, fastai uses the technique I mentioned above and they try to use the best practices generally.

Actually VSL technique I first saw on fastai. I think they are using it.
No worries, maybe someone will find a good link on mini batches or write a post on or ~. I find understanding MB quite essential.
I am writing examples to understand them better according to your tips.

Looking forward for medium post

For an text autoencoder network I ended up writing my own data loader that not only sorts but also buckets sequences according to their length – contrast to the BucketIterator of torchtext which only buckets sequences of similar lengths.

This has a couple of advantages:

  • No need for padding and the potential loss efficiency
  • No need for PackedSequence
  • In the decoder I can generate the words in loop for the whole batch since they have all the same length. Existing examples for Seq2Seq models assume batch sizes of 1. That’s way too slow :slight_smile:

For this autoencoder use case, bucketing sequences is the most pain-free solution. It’s a just one-time step during data preparation. While some batches might not be full – e.g., when the batch size is 64 but there are exactly 100 sequences of length 30, resulting in 2 batches of sizes 64 and 36 – if the dataset is large enough, >99% of batches are full. So no performance loss here.