How to pad variable length input tensor representing text

neuralpat · January 3, 2021, 12:29pm

Hi,

This is my model:

ImdbReviewModel(
(embed): Embedding(95423, 30)
(gru): GRU(30, 128)
(fc1): Linear(in_features=128, out_features=1, bias=True)
)

I’m wanting to do sentiment analysis on the imdb dataset, but I’m having trouble batching the data.
Obviously, review all have different lengths which would not be a problem for the embedding layer, but dataloader is not able to batch it.

Now, I think that I’m supposed to to pad my tensors but I really don’t understand how. I’ve looked at a lot of threads in this forum and I’ve looked at the docs but I have no idea what to do.
I guess I have to write my own collate_fn and pad inside it but I’m not sure which function I should use for padding and what that function has to return.

Also it seems that packed-padding is meant for RNNs, but my first layer is an embedding layer so maybe that’s not even the right thing?

Help would be very much appreciated.

This is my current code:

train_loader = DataLoader(train_dataset.dataset, 10, shuffle=True, collate_fn=collate_fn_padd)

def collate_fn_padd(batch):
    all_inputs = []
    all_labels = []

    for sample in batch:
        input, label = sample
        all_inputs.append(input)
        all_labels.append(label)
    batch = torch.nn.utils.rnn.pad_sequence(all_inputs)

    batch = list(zip(batch, all_labels))

    return batch

2 problems with that code. 1. the tensors don’t actually seem padded (no zeros), 2. that gives a list, not a “proper” batch, that I can feed to the network right?

neuralpat · January 3, 2021, 2:09pm

Okay, I solved the issue. I padded my tensors with this function (inside the dataset):

 def pad_tensor(t):
     t = torch.tensor(t)
     padding = max(max_review_len) - t.size()[0]
     t = torch.nn.functional.pad(t, (0, padding))
     return t