I don’t understand why there is a mask in the transformer encoder?
In decoder, we don’t want the decoder to look at the future and it makes sense to mask but why encoder?
and what is the use of the below code? math.sqrt(self.ninp) in
see if it helps, regarding math.sqrt(self.ninp)
in the paper https://arxiv.org/pdf/1706.03762.pdf, they use a scaling factor of math.sqrt(self.ninp), maybe cancelling this scaling factor (or not considering to divide by math.sqrt(self.ninp)), gives a better accuracy in the tutorial.
Well, actually my question is different from the one you asked.
My question is what’s the need for mask in the encoder layer?
But I think it makes sense when you want to use an architecture like GPT 2, I guess. Correct me if I’m wrong.
Regarding the scaling factor, even if we don’t scale it shouldn’t affect the results a lot, I can’t recollect the paper but in the earlier version of dot product attention even they didn’t use the scale factor. But obviously scaling helps.
But I don’t understand why are we multiplying it? it should be divided based on the attention paper.
The reason we increase the embedding values [by adding math.sqrt(self.d_model)] before addition is to make the positional encoding relatively smaller. This means the original meaning in the embedding vector won’t be lost when we add them together.