Gradient calculation for torch.nn.Embedding

kevs · May 3, 2021, 2:19am

How is the gradient for torch.nn.Embedding calculated? The weight is simply a lookup table - is the gradient being propagated only for the certain indices?

I also have a side question if anyone is knows anything about fine-tuning the BERT model. Are the embedding layers weights adjusted when fine-tuning? I assume they are since the paper states:

… all of the parameters are fine-tuned using labeled data from the
downstream tasks.

But wouldn’t changing the embedding parameters (for example the positional encoding) distort the model? I am not sure.

ptrblck · May 3, 2021, 8:50am

Yes, that’s the case:

emb = nn.Embedding(num_embeddings=4, embedding_dim=10)
x = torch.tensor([[1]])
out = emb(x)
out.mean().backward()
print(emb.weight.grad)
> tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
           0.0000],
          [0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000,
           0.1000],
          [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
           0.0000],
          [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
           0.0000]])

I don’t know the answer to the BERT-related question.

kevs · May 4, 2021, 12:51am

Thanks a lot, how would it handle duplicate indices? Is the gradient accumulated for duplicate in this case? For example if emb.weight has the shape (4, 10) and x = tensor([[0, 0, 0, 1, 1, 1, 1]]) with the shape (1,7).

kevs · May 4, 2021, 12:53am

Ok, just tested it and that seems to be the case