How to pad a batch of documents?

Mainul · August 18, 2020, 9:16pm

Hello PyTorch experts:

Sentences and documents both can be variable length.

Lets say, we have following 2 docs:

doc1=[torch.tensor([1,2,3,4]),torch.tensor([4,5,6]),torch.tensor([7,5])]
doc2=[torch.tensor([1,2,3]),torch.tensor([4,5])]

(Here, each tensor is a sentence and each number in the tensor is an index of embedding matrix)

doc1=pad_sequence(doc1,batch_first=True)
doc2=pad_sequence(doc2,batch_first=True)

batch=pad_sequence([doc1,doc2],batch_first=True)

This will throw an error because the length of the longest sentence in the first document is greater than the length of the longest sentence in the second sentence.

So, we need to pad each sentence to the longest sentence of the batch.

One solution would be, find the length of the longest sentence in each batch and pad each sentence to that fixed length. But the pad_sequence function of PyTorch does not support that.

Am I missing something? Is there any other PyTorch way to do this?

vdw · August 18, 2020, 11:36pm

pad_sequence takes as input a list of tensors. However, you give it a list of list of tensors.

pad_sequence can only pad all sequences within the same list of tensors (e.g., doc1 and doc2) but not across multiple lists. You probably need to do this manually as you described: find the longest sentence among all documents and then pad all sentences accordingly.

Mainul · August 19, 2020, 10:57am

Would not it be nice to have padding function like this: pad list of list of tensors?
or at least have the ability to pad to a maximum length. Because transforming a document into a tensor is a common thing in NLP.

vdw · August 20, 2020, 1:41am

I’m not sure if this is a practical enough use case. In general, you aim to minimize padding, which essentially boils down to padding only to the length of the longest sequence in a batch. And there are existing solutions like the BucketIterator that creates batch which sequence of almost always equal length. At least for RNNs, that’s the de-facto way to do it. For CNNs and Transformers, things might be a bit different.

If you pad all sequences with respect to the longest one across your whole dataset/corpus – that is, beyond the scope of a batch – you create many batches where all sequences have padding. That’s particularly bad if you have some few or just one very long sentence. Sure, with pack_padded_sequences you get around this but you still loss training speed when your sequences are unnecessarily long (again, at least for RNNs).