I was thinking on mini batch shape.
I guess it should be a matrix:
sequence_length
* number_of_batches
?
Is there any paper that explains mini batches simple and clear?
I was thinking on mini batch shape.
I guess it should be a matrix:
sequence_length
* number_of_batches
?
Is there any paper that explains mini batches simple and clear?
I don’t know of any such paper. But one way of overcoming your doubts is to use a single example as input and use various print statements to see what shapes do I want, what shapes to give as input.
Strange, I dug deep and still haven’t found interesting material, just as you mentioned.
I would like if I can also find some insights on variable sequence_length since this is something I found also possible.
Thanks for the feedback!
As far as variable_sequence length is concerned, there are two ways to resolve this
Combining both the above techniques is the best practice for variable sequence_length
Hi. There is a technique called variable sequence length. It assumes no padding. This is what I meant, to generally understand the idea of mini beaches.
But from a performance point of view I think it is not that good. Also, fastai uses the technique I mentioned above and they try to use the best practices generally.
Actually VSL technique I first saw on fastai. I think they are using it.
No worries, maybe someone will find a good link on mini batches or write a post on medium.com or ~. I find understanding MB quite essential.
I am writing examples to understand them better according to your tips.
Looking forward for medium post
For an text autoencoder network I ended up writing my own data loader that not only sorts but also buckets sequences according to their length – contrast to the BucketIterator
of torchtext
which only buckets sequences of similar lengths.
This has a couple of advantages:
PackedSequence
For this autoencoder use case, bucketing sequences is the most pain-free solution. It’s a just one-time step during data preparation. While some batches might not be full – e.g., when the batch size is 64 but there are exactly 100 sequences of length 30, resulting in 2 batches of sizes 64 and 36 – if the dataset is large enough, >99% of batches are full. So no performance loss here.