How is the gradient for torch.nn.Embedding calculated? The weight is simply a lookup table - is the gradient being propagated only for the certain indices?
I also have a side question if anyone is knows anything about fine-tuning the BERT model. Are the embedding layers weights adjusted when fine-tuning? I assume they are since the paper states:
… all of the parameters are fine-tuned using labeled data from the
downstream tasks.
But wouldn’t changing the embedding parameters (for example the positional encoding) distort the model? I am not sure.
Thanks a lot, how would it handle duplicate indices? Is the gradient accumulated for duplicate in this case? For example if emb.weight has the shape (4, 10) and x = tensor([[0, 0, 0, 1, 1, 1, 1]]) with the shape (1,7).