How can I set batch size in Natural Language Processing

soojin99 · April 1, 2020, 2:52pm

You see, almost every training uses batch.
I can do understand how batch and each epoch works when
processing images, because their size (pixels and channels) are same w each other.
But in natural language, lengths of sentence are different and those of documents are too.
and because of that, the size of output is different word to word in word2vec,
or rnn families.

So how can I set the batch size?
Batch means a pile of images in cnn or something,
but does it mean some fixed-size words…?

Please help this noob. Thank you.

ptrblck · April 2, 2020, 2:11am

You could pack the input tensors with different length using e.g. torch.nn.utils.rnn.pack_sequence and later pad them to the longest sequence via torch.nn.utils.rnn.pad_packed_sequence.

@vdw has also posted an approach without padding, where the input sequences are sorted to avoid padding.