Hello,
I think there is still an ambiguity in the documentation, which states that “the entries at padding_idx
do not contribute to the gradient; therefore, the embedding vector at padding_idx
is not updated during training”.
Here is a toy example in which I sum the entries of the embedding matrix and backpropagate from there:
a = nn.Embedding(4, 4, padding_idx = 0)
S = a.weight.sum()
S.backward()
a.weight.grad
tensor([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
optimizer = optim.SGD(a.parameters(), lr=0.1)
optimizer.step()
a.weight
Parameter containing:
tensor([[-0.1000, -0.1000, -0.1000, -0.1000],
[-1.2080, -0.6763, -0.1524, -2.1535],
[-0.0298, -1.2028, 1.4500, -0.8885],
[-0.8094, -0.3046, -2.0699, -0.7560]], requires_grad=True)
I see that the entry at the PAD index does develop a gradient and does update.
Could you explain how to understand the documentation? (Or what I’ve done wrong?)