Hi everyone,

I have a couple of questions about two of the features of nn.Embedding: **scale_grad_by_freq** and **max_norm**.

Firstly, I can’t find any references in the literature to either of these being used. Would it be possible for someone to point out a frequent use case or point me to a relevant academic paper?

I can see that it would be particularly useful to minimize the importance of stopwords during training by down-scaling the embedding output’s gradient by the token frequencies (as measured by the entire vocabulary). From reading other posts in this forum though, this is not what that layer is doing (rather it uses the frequency in the batch to rescale gradients). Could anyone explain what the purpose would be of re-scaling the gradient by the within-batch token frequency? Is this method used / shown to be effective in any recent deep-learning papers?

One other question that I had: in the ‘Attention is all you Need’ paper they scale the token embedding output upwards by sqrt(embedding dim) before adding them to the positional embeddings. In practice this usually means multiplying the word vectors by a factor of ~20. I imagine this policy is likely to be prone to error since the relative magnitudes of the token embeddings and position embeddings can depend on a variety of factors (choice of pretrained vectors, whether to use sinusoidal or positional vectors, initialization choice when training from scratch). Would it be feasible to try to reduce the risk of scaling errors by simply using the max_norm term here to fix the scales for both tokens and positions prior to summing them (thereby forgetting the dimensional scale factor)?

Thanks a lot for any help