Pytorch Transformers

I created one post for a similar question here,

see if it helps, regarding math.sqrt(self.ninp)
in the paper https://arxiv.org/pdf/1706.03762.pdf, they use a scaling factor of math.sqrt(self.ninp), maybe cancelling this scaling factor (or not considering to divide by math.sqrt(self.ninp)), gives a better accuracy in the tutorial.