So I have a model where I have an embedding layer (nn.Embedding) and a final nn.Linear projection layer that are sharing weights via weight tying.
It seems like the best practice is not to perform weight decay on embedding weights, but to perform decay on linear layer weights. What should I do in this situation?
Here are the pages I have checked unsuccessfully in search of an answer.
- Weight decay in the optimizers is a bad idea (especially with BatchNorm)
- Weight decay exclusions by michaellavelle · Pull Request #24 · karpathy/minGPT · GitHub
- regularization - Why not perform weight decay on layernorm/embedding? - Cross Validated
- python - Tying weights in neural machine translation - Stack Overflow
- Weight decay only for weights of nn.Linear and nn.Conv*