How Does the gradient for torch.nn.Embedding layer works

Rahul_Vashisht · December 12, 2023, 2:11pm

Hi,

If I keep the embedding layer with a very large vocab size, but my training data has only a few tokens from the vocabulary. Does the vector representation of the tokens which are not part of the training also change?

ptrblck · December 12, 2023, 4:43pm

No, the unused vectors do not get any gradients from the backward pass and would thus not be updated unless you add the weight tensor to a weight decay term or any additive loss.

Rahul_Vashisht · December 13, 2023, 4:35am

Okay, so weight decay will affect the embedding of all the tokens; I understood it. Thanks for the reply