Why does the transformer tutorial have a multiplication by square root of the number of inputs?

Brando_Miranda · July 14, 2021, 2:41pm

Why does the transformer tutorial have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there one with the output of the encoder?

In particular:

src = self.encoder(src) * math.sqrt(self.ninp)

from: Language Modeling with nn.Transformer and TorchText — PyTorch Tutorials 1.9.0+cu102 documentation

Reference code:

import math

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer

class TransformerModel(nn.Module):

    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        super(TransformerModel, self).__init__()
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(ninp, dropout)
        encoder_layers = TransformerEncoderLayer(ninp, nhead, nhid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.encoder = nn.Embedding(ntoken, ninp)
        self.ninp = ninp
        self.decoder = nn.Linear(ninp, ntoken)

        self.init_weights()

    def generate_square_subsequent_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src, src_mask):
        src = self.encoder(src) * math.sqrt(self.ninp)
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src, src_mask)
        output = self.decoder(output)
        return output

or

self.embedding(tokens.long()) * math.sqrt(self.emb_size)`

AlphaBetaGamma96 · July 14, 2021, 2:48pm

I’m not an expert in NLP (so double check with an expert), but I believe the square root is to normalize the encoding with respect to the input size so the embedding isn’t dependent on the size of the input. As you’re taking a dot product, the magnitude of the dot product with scale with the square-root of the dimension size and by dividing by that value you normalize it!

Brando_Miranda · July 14, 2021, 2:51pm

hmmm…my understanding is that there are two of these going on.

One inside the MHA:

embed  = sf(QK/d**0.5)V

but the one here

is the following:

emebd = encoder(x) * T_x**0.5

which seems different. If you argument holds if anything I would have expected to divide by T_x**0.5 not multiply it… though your comment gave me something to think about…

AlphaBetaGamma96 · July 14, 2021, 4:02pm

I’m not too sure, what exactly is T_x? If I had to guess I’d say maybe you normalize out the dimension within the self-attention mechanism but add it back in after the calculation is done to preserve the dimensionality of the input?

Brando_Miranda · July 14, 2021, 8:28pm

T_x = sequence length for input. Usually the sequence length is denoted with a t e.g. x^<t> and the subscript x to emphasizes its the input.

Brando_Miranda · July 14, 2021, 8:43pm

On a related note it seems that the other tutorial Language Translation with nn.Transformer and torchtext — PyTorch Tutorials 1.9.0+cu102 documentation also has something like that (but perhaps for a different reason):

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

Brando_Miranda · July 20, 2021, 4:04pm

@albanD apologies for the direct tag. Do you know who to tag that might know the answer to this?

edit2:
@ptrblck apologies for the direct tag. Do you know who to tag that might know the answer to this?

111550 · May 16, 2022, 9:38am

You might don’t need this anymore, but the answer is:
https://datascience.stackexchange.com/questions/87906/transformer-model-why-are-word-embeddings-scaled-before-adding-positional-encod/87909#87909