Questions about one_batch() and padding

pattiJane · May 19, 2020, 2:31pm

Hello all,

SortSampler (validation) and SortishSampler (training) both act as follows:

collate the samples in batches while adding padding with pad_idx . If pad_first=True , padding is applied at the beginning (before the sentence starts) otherwise it’s applied at the end.

So does this mean the sequence length will be maximum sequence in batch? or each partition (e.g. train, valid) or whole dataset?

So, I have set pad_first=True and I wanted to check this by looking at several batches,

I have RNNLearner as my model, and x = model.data.one_batch() shows a batch from training_dl. Here is my output:

tensor([[1, 2, 1, …, 1, 1, 1],
[0, 0, 0, …, 4, 2, 1],
[0, 0, 0, …, 1, 1, 3],
…,
[0, 0, 0, …, 2, 4, 3],
[0, 0, 0, …, 3, 1, 4],
[0, 0, 0, …, 3, 3, 1]])

And the shape of this tensor is :

x[0].shape
torch.Size([32, 11556])

Where 32 is batch size and 11556 is the longest sequence in my dataset. And when I want to look at a batch from validation set, y=model.data.one_batch(ds_type=2)

(tensor([[1, 2, 1, …, 1, 1, 1],
[4, 1, 2, …, 3, 2, 1],
[0, 0, 0, …, 3, 2, 4],
…,
[0, 0, 0, …, 2, 1, 2],
[0, 0, 0, …, 3, 1, 3],
[0, 0, 0, …, 3, 3, 2]]),
tensor([1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1,
0, 1, 0, 0, 1, 1, 0, 0]))

y[0].shape
torch.Size([32, 11556])

However, the longest sequence in the validation is around 6000 characters, so this tensor size does not make sense.

Can you please help me about these? I also want to look at other batches and not sure what it is the correct way to do that.

Thank you!