My take on how to solve this issue:
def collate_fn_padd(batch):
'''
Padds batch of variable length
note: it converts things ToTensor manually here since the ToTensor transform
assume it takes in images rather than arbitrary tensors.
'''
## get sequence lengths
lengths = torch.tensor([ t.shape[0] for t in batch ]).to(device)
## padd
batch = [ torch.Tensor(t).to(device) for t in batch ]
batch = torch.nn.utils.rnn.pad_sequence(batch)
## compute mask
mask = (batch != 0).to(device)
return batch, lengths, mask
There seems to be a large collection of posts all over pytorch that makes it difficult to solve this issue. I have collected a list of all of them hopefully making things easier for all of us. Here:
- How to create batches of a list of varying dimension tensors?
- How to create a dataloader with variable-size input
- Using variable sized input - Is padding required?
- DataLoader for various length of data
- How to do padding based on lengths?
bucketing:
Also, Stack-overflow has a version of this question too:
crossposted: https://www.quora.com/unanswered/How-does-Pytorch-Dataloader-handle-variable-size-data