What happens when we don't set padding_idx?

wenlongzhao094 · April 5, 2020, 3:46am

What are the usage of padding_idx for nn.Embedding? It seems that the vector at padding_idx will be initialized as zeros. Also this vector is not trained. But when is it used for padding?

What happens when we do not set a padding_idx?
What happens when we do not set a padding_idx but still have at the beginning of the vocab?

ptrblck · April 5, 2020, 7:02am

If padding_idx is set, the output tensor will contain all zeros at the position, where the input tensor had the padding_idx.
You could use it to ignore specific inputs without the need to remove these index from your input tensor.

gat1s · May 1, 2022, 12:03pm

Hello,

I think there is still an ambiguity in the documentation, which states that “the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training”.

Here is a toy example in which I sum the entries of the embedding matrix and backpropagate from there:

a = nn.Embedding(4, 4, padding_idx = 0)

S = a.weight.sum()

S.backward()

a.weight.grad
tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])

optimizer = optim.SGD(a.parameters(), lr=0.1)

optimizer.step()

a.weight
Parameter containing:
tensor([[-0.1000, -0.1000, -0.1000, -0.1000],
        [-1.2080, -0.6763, -0.1524, -2.1535],
        [-0.0298, -1.2028,  1.4500, -0.8885],
        [-0.8094, -0.3046, -2.0699, -0.7560]], requires_grad=True)

I see that the entry at the PAD index does develop a gradient and does update.

Could you explain how to understand the documentation? (Or what I’ve done wrong?)

Nils_Smitham86 · March 11, 2024, 12:23pm

you should not use embedding weight straightly, in your code, you treate embedding weights as a ordinary tensor, but not nn.Embedding() layer or nn.Embedding().forward() function