What happens when we don't set padding_idx?

What are the usage of padding_idx for nn.Embedding? It seems that the vector at padding_idx will be initialized as zeros. Also this vector is not trained. But when is it used for padding?

What happens when we do not set a padding_idx?
What happens when we do not set a padding_idx but still have at the beginning of the vocab?

If padding_idx is set, the output tensor will contain all zeros at the position, where the input tensor had the padding_idx.
You could use it to ignore specific inputs without the need to remove these index from your input tensor.


I think there is still an ambiguity in the documentation, which states that “the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training”.

Here is a toy example in which I sum the entries of the embedding matrix and backpropagate from there:

a = nn.Embedding(4, 4, padding_idx = 0)

S = a.weight.sum()


tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])

optimizer = optim.SGD(a.parameters(), lr=0.1)


Parameter containing:
tensor([[-0.1000, -0.1000, -0.1000, -0.1000],
        [-1.2080, -0.6763, -0.1524, -2.1535],
        [-0.0298, -1.2028,  1.4500, -0.8885],
        [-0.8094, -0.3046, -2.0699, -0.7560]], requires_grad=True)

I see that the entry at the PAD index does develop a gradient and does update.

Could you explain how to understand the documentation? (Or what I’ve done wrong?)