Hi,
If I keep the embedding layer with a very large vocab size, but my training data has only a few tokens from the vocabulary. Does the vector representation of the tokens which are not part of the training also change?
Hi,
If I keep the embedding layer with a very large vocab size, but my training data has only a few tokens from the vocabulary. Does the vector representation of the tokens which are not part of the training also change?
No, the unused vectors do not get any gradients from the backward pass and would thus not be updated unless you add the weight
tensor to a weight decay term or any additive loss.
Okay, so weight decay will affect the embedding of all the tokens; I understood it. Thanks for the reply