How Does the gradient for torch.nn.Embedding layer works


If I keep the embedding layer with a very large vocab size, but my training data has only a few tokens from the vocabulary. Does the vector representation of the tokens which are not part of the training also change?

1 Like

No, the unused vectors do not get any gradients from the backward pass and would thus not be updated unless you add the weight tensor to a weight decay term or any additive loss.

Okay, so weight decay will affect the embedding of all the tokens; I understood it. Thanks for the reply