Hi everyone, I think there might be a bug in the gradients of nn.Embedding
when sparse=True
and padding_idx
is set. Below are some code snippets that could reproduce this.
import torch.nn as nn
from torch.autograd import Variable
# below is the same code provided in the documentation
# http://pytorch.org/docs/master/nn.html#embedding
input = Variable(torch.LongTensor([[0,2,0,5]]))
embedding = nn.Embedding(10, 3, padding_idx=0)
model = nn.Sequential(embedding)
opt = torch.optim.SGD(model.parameters(), 0.01)
opt.zero_grad()
loss = torch.sum(model(input))
loss.backward()
opt.step()
print(embedding.weight.data)
The first script should print something like this. After opt.step()
the first row is still zero, as mentioned in this reply that the embedding gradient of the padding index is ignored.
0.0000 0.0000 0.0000 -1.0657 -1.0059 -1.4740 0.5380 -0.5131 0.1291 0.0899 -1.4056 0.0625 0.1345 -1.0449 -1.5367 0.9558 2.8128 -2.5808 0.9454 0.0503 -2.6308 -1.5984 -0.4989 0.0800 1.7455 -1.4634 -1.4889 -1.0654 0.2526 1.0377 [torch.FloatTensor of size 10x3]
However when you set sparse=True
, after update the first row will become nonzero. Could this be a bug? ( my pytorch version is 0.2.0_3. )
import torch
import torch.nn as nn
from torch.autograd import Variable
# the row of `padding_idx` becomes non-zero after update when `sparse=True`
input = Variable(torch.LongTensor([[0,2,0,5]]))
embedding = nn.Embedding(10, 3, padding_idx=0, sparse=True)
model = nn.Sequential(embedding)
opt = torch.optim.SGD(model.parameters(), 0.01)
opt.zero_grad()
loss = torch.sum(model(input))
loss.backward()
opt.step()
print(embedding.weight.data)