How to embed batch of sentences with varied word numbers

Hi,

If I want to embed a batch of 2 samples of 4 indices each, I know I can do it as follows:

import torch
import torch.nn as nn
embedding = nn.Embedding(10, 3)
input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
embedding(input)

However, I don’t know how can I embed a batch of 2 samples with different lengths like below:

import torch
import torch.nn as nn
embedding = nn.Embedding(10, 3)
input2 = torch.LongTensor([[1,2,4,5],[4,3,2]])
# Now I got error here saying:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: expected sequence of length 4 at dim 1 (got 3)

I saw that I can provide a padding id and use it to balance the sequence lengths:

embedding = nn.Embedding(10, 3, padding_idx=0) 
input = torch.LongTensor([1,2,4,5],[4,3,2,0]) 
embedding(input)

I have 3 questions regarding this usage:

  1. Does this padding solve my problem ? I mean, can I use it as shown above to balance the sequence lenghts.
  2. Does it differ to put the padding token at the beginning of sentence or at the end of sentence ?
  3. Does this padding token effect the calculations during back-propagation ?
1 Like