Padding zeros in nn.Embedding while using pre-train word vectors

Hi, I have come across a problem in Pytorch about embedding in NLP.
Suppose I have |N| sentences with different length, and I set the max_len is the max length among the sentences, while the other sentences need to pad zeros vectors. Define the Embedding as below with one extra zero vectors at index vocab_size

emb = nn.Embedding(vocab_size +1, emb_num, padding_idx=vocab_size)

But when I use my pre-train word vectors: I need to construct a vectors including a zero vectors manually:

myw2v=torch.from_numpy('myw2v.model')
zeros = torch.zeros(1, emb_num)
myw2v = torch.cat((myw2v, zeros), 1)
emb.weight.data.copy_(torch.from_numpy(myw2v))

So I’d like to know whether there is a function in pytorch that can handle the situation above?
Thanks.

1 Like

Did you figure this out? I’d also like to know the same.

I just manually using torch.cat to do this job and can not find out the function in Pytorch

1 Like

As far as I understand (vocab_size +1) is the maximum unique element for embedding. Your input size does not have to be the vocab_size.
When you use padding_idx=vocab_size you are saying that input will always have the size of your vocabulary and this is not the case.
There can be 2 solutions:

  1. You can give padding_idx the size of your input after padding ( max_len + 1)
  2. You can add zero at the beginning of each sentence with padding_idx=0. After embedding, remove the first embedded element and give it as input to the next layer.

I hope it is clear.

This is an example (I create my input by adding zero(my padding) at the beginning ) but you can use cat like this:

input = torch.tensor([[8, 1, 2, 5],[9, 6, 0, 0]])
zeros = torch.zeros(2, 1).long()
input = torch.cat((zeros,input), 1)
print(input)
emb = nn.Embedding(10,3,padding_idx=0)
embedded = emb(input)[:,1:]
print(embedded)

output:

tensor([[0, 8, 1, 2, 5],
        [0, 9, 6, 0, 0]])
tensor([[[ 1.5745, -2.3024, -0.5964],
         [-1.5407,  0.8915, -1.3858],
         [-1.2728,  1.5402,  0.7473],
         [ 0.1958,  0.4828,  0.5091]],

        [[-0.1123, -0.3911,  0.4758],
         [ 0.0876, -0.6896,  0.0802],
         [ 0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000]]], grad_fn=<SliceBackward>)
1 Like

@smth could you please help us in this issue!!! Thanks !

You can specify a certain index to represent the ‘zero padding’ with torch.nn.Embedding(padding_idx=,…). Then, for each sentence of size less than max_len, create a list in the size of (max_len-sentence size) , of which its elements are actually the paddind_idx index. Then add this list to the already existing list of word indices representing the sentence.

In my case the padding index was -1.
I did:
emb = nn.Embedding(vocab_size +1, emb_num, padding_idx=vocab_size)
my input sequence is named ‘tgt’

tgt_pad = tgt.clone()
tgt_pad[tgt == -1] = self.vocab_size
emb_vec = emb(tgt_pad)