How to do padding based on lengths?

tom · September 4, 2018, 7:57am

I think you are looking for torch.nn.utils.rnn.pad_sequence.

If you want to do this manually:

One greatly underappreciated (to my mind) feature of PyTorch is that you can allocate a tensor of zeros (of the right type) and then copy to slices without breaking the autograd link. This is what pad_sequence does (the source code is linked from the “headline” in the docs). The crucial bit is:

    out_tensor = sequences[0].data.new(*out_dims).fill_(padding_value)
    for i, tensor in enumerate(sequences):
        length = tensor.size(0)
        # use index notation to prevent duplicate references to the tensor
        if batch_first:
            out_tensor[i, :length, ...] = tensor
        else:
            out_tensor[:length, i, ...] = tensor

If the tensors require grad, so will out_tensor and the gradients will flow back to the tensors in the list.

Another way to do this, that seems closer to your description, is to use a cat (or pad) in a list comprehension and push that to another cat.

# setup
import torch
l = [torch.tensor([1,2,3]), torch.tensor([4,5]),torch.tensor([6,7,8,9])]
emb_len=4
# this is what you want:
lp = torch.stack([torch.cat([i, i.new_zeros(emb_len - i.size(0))], 0) for i in l],1)

Best regards

Thomas