Pytorch Transformers

Anandh_Perumal_Konar · April 16, 2020, 6:12am

The below code is from pytorch doc

    def forward(self, src):
        if self.src_mask is None or self.src_mask.size(0) != len(src):
            device = src.device
            mask = self._generate_square_subsequent_mask(len(src)).to(device)
            self.src_mask = mask

        src = self.encoder(src) * math.sqrt(self.ninp)
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src, self.src_mask)
        output = self.decoder(output)
        return output

I don’t understand why there is a mask in the transformer encoder?
In decoder, we don’t want the decoder to look at the future and it makes sense to mask but why encoder?

and what is the use of the below code?
math.sqrt(self.ninp) in

src = self.encoder(src) * math.sqrt(self.ninp)

Thanks in advance

vainaijr · April 17, 2020, 12:51pm

I created one post for a similar question here,

see if it helps, regarding math.sqrt(self.ninp)
in the paper https://arxiv.org/pdf/1706.03762.pdf, they use a scaling factor of math.sqrt(self.ninp), maybe cancelling this scaling factor (or not considering to divide by math.sqrt(self.ninp)), gives a better accuracy in the tutorial.

Anandh_Perumal_Konar · April 17, 2020, 11:06pm

Well, actually my question is different from the one you asked.
My question is what’s the need for mask in the encoder layer?
But I think it makes sense when you want to use an architecture like GPT 2, I guess. Correct me if I’m wrong.

Regarding the scaling factor, even if we don’t scale it shouldn’t affect the results a lot, I can’t recollect the paper but in the earlier version of dot product attention even they didn’t use the scale factor. But obviously scaling helps.

But I don’t understand why are we multiplying it? it should be divided based on the attention paper.

ShaneB · May 20, 2021, 2:50pm

Again, regarding the scaling factor:

The paper does say:

In the embedding layers, we multiply those weights by sqrt(d_model).

(Note The Annotated Transformer directly connects the paper to PyTorch code line-by-line).

However, the paper does not say why this is done.

This post says the following:

The reason we increase the embedding values [by adding math.sqrt(self.d_model)] before addition is to make the positional encoding relatively smaller. This means the original meaning in the embedding vector won’t be lost when we add them together.