When should I use sparse embedding instead of dense embedding

When should I choose to set sparse=True for an Embedding layer? What are the pros and cons of the sparse and dense versions of the module?

3 Likes

When most of the embeddings are not learnt during training, that is representation of only few words is updated, rest of the representations stay as they were, for example,

class A(nn.Module):
  def __init__(self):
    super().__init__()
    self.embedding = nn.Embedding(10, 10, sparse=True)
  def forward(self, x):
    return 2*self.embedding(x)
net = A()
loss = net(torch.LongTensor([8, 7])).sum()
for param in net.parameters():
    print(param.grad)

None

loss.backward()
for param in net.parameters():
  print(param.grad)

gives

tensor(indices=tensor([[8, 7]]),
       values=tensor([[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
                      [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]]),
       size=(10, 10), nnz=2, layout=torch.sparse_coo)

not all word embeddings that we have represented are updated during training, only a few are updated, if we do not use sparse=True, then,

loss.backward()
for param in net.parameters():
  print(param.grad)

would give something like,

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
        [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

most of the embeddings are not being updated during training, so probably it is better to use sparse=True, if we were passing all of our inputs to our neural network, and all of the embeddings were getting updated, then we would have set sparse=False.

4 Likes