What happens when we don't set padding_idx?

What are the usage of padding_idx for nn.Embedding? It seems that the vector at padding_idx will be initialized as zeros. Also this vector is not trained. But when is it used for padding?

What happens when we do not set a padding_idx?
What happens when we do not set a padding_idx but still have at the beginning of the vocab?

If padding_idx is set, the output tensor will contain all zeros at the position, where the input tensor had the padding_idx.
You could use it to ignore specific inputs without the need to remove these index from your input tensor.

Hello,

I think there is still an ambiguity in the documentation, which states that “the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training”.

Here is a toy example in which I sum the entries of the embedding matrix and backpropagate from there:

a = nn.Embedding(4, 4, padding_idx = 0)

S = a.weight.sum()

S.backward()

a.weight.grad
tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])

optimizer = optim.SGD(a.parameters(), lr=0.1)

optimizer.step()

a.weight
Parameter containing:
tensor([[-0.1000, -0.1000, -0.1000, -0.1000],
        [-1.2080, -0.6763, -0.1524, -2.1535],
        [-0.0298, -1.2028,  1.4500, -0.8885],
        [-0.8094, -0.3046, -2.0699, -0.7560]], requires_grad=True)

I see that the entry at the PAD index does develop a gradient and does update.

Could you explain how to understand the documentation? (Or what I’ve done wrong?)

you should not use embedding weight straightly, in your code, you treate embedding weights as a ordinary tensor, but not nn.Embedding() layer or nn.Embedding().forward() function