Does mini-batched LSTM have better performance?

zarzen · March 19, 2019, 2:35am

we can implement LSTM training in either mini-batch way or feeding samples one by one.
Is there any performance differences between those two method?

vdw · March 19, 2019, 6:04am

Not quite sure if I understand the question correctly. In terms of calculation time, batches are much faster that feeding the training data one by one. Technically, you want to maximize the batch size for the best performance, but memory size will stop you at some point.

The larger the batch size, the more you average out the different gradients if they are (very) different for the items within one batch. So in general, You can often use higher learning rates for larger batches. You might also have a look at the paper “Don’t Decay the Learning Rate, Increase the Batch Size”.

Many Seq2Seq examples are batches of size 1 since the decider predicts the next word step by step and the number of steps might be different for the input sequences in a batch, even if the input sequences themselves have the same lengths. So yeah, in such encoder-decoder setups, feeding the data one by one might be the only/best/straightforward solution. But for a classification task there’s in my opinion no reason not to go for batches.

zarzen · March 19, 2019, 8:51pm

Hi Chris
Thanks for pointing out the paper! I will check it.

It is a kind minor implementation issues, if I want to go with batch I need to including padding staff for variation lengths.
So I want to know whether it is necessary to have batch in LSTM, since calculation time is not the first concern.

vdw · March 20, 2019, 1:25am

Yeah, the whole padding thing seem always a bit questionable :). I have actually no idea how much it effects the effectiveness of an RNN layer, particularly when some sequences in the batch are really short, i.e., have much padding. From my experience so far you have several options:

You use padded batches “as is” and hope the padding doesn’t cause any harm :).
You can use PackedSequence. In this case, you have to be a bit more careful how you use the return values from the GRU layer. See my post on this topic.
You can use the ‘BucketIterator’ from torchtext. This data loader sorts the sequences and creates batches with sequences of the same or at least similar length. For large datasets, most batches will have sequences of the same length; in all other batches the padding will be minimal.
You can write your own data loader that guarantees that all sequences have the same length. I did this for an autoencoder network (seq2seq) because of the word-by-word loop in the decoder. Here, some batches might not be full, but given a large dataset this is negligible.

In general, it depends what you want to do: sequence classification, sequence tagging, sequence-to-sequence (autoencoder, machine translation), …?

I always try (a) to go with batches simply for performance reasons and (b) to minimize the need for padding. For everything except machine translation I can currently avoid padding with my custom data loader.