How to create batches of a list of varying dimension tensors?

Paulo_Mann · July 17, 2019, 8:58pm

I think it could be possible for you to implement something like PackedSequence implements, but for your case. In other words, you can manually pad your “sequences” (they are really not, as you mentioned) and pass their “lengths” along with them through the Dataset collate_fn function, so you know how to get the essential values back (i.e., you have a size of (5,100), but in fact you only need [0:2, :] from this vector). In practice, you could do something like it:

def collate_fn(data):
    """
       data: is a list of tuples with (example, label, length)
             where 'example' is a tensor of arbitrary shape
             and label/length are scalars
    """
    _, labels, lengths = zip(*data)
    max_len = max(lengths)
    n_ftrs = data[0][0].size(1)
    features = torch.zeros((len(data), max_len, n_ftrs))
    labels = torch.tensor(labels)
    lengths = torch.tensor(lengths)

    for i in range(len(data)):
        j, k = data[i][0].size(0), data[i][0].size(1)
        features[i] = torch.cat([data[i][0], torch.zeros((max_len - j, k))])

    return features.float(), labels.long(), lengths.long()

The function above is fed to the collate_fn param in the DataLoader, as this example:

DataLoader(toy_dataset, collate_fn=collate_fn, batch_size=5)

With this collate_fn function, you always gonna have a tensor where all your examples have the same size. So, when you feed your forward() function with this data, you need to use the length to get the original data back, to not use those meaningless zeros in your computation.