Why does the transformer tutorial have a multiplication by square root of the number of inputs?

Brando_Miranda · July 14, 2021, 8:43pm

On a related note it seems that the other tutorial Language Translation with nn.Transformer and torchtext — PyTorch Tutorials 1.9.0+cu102 documentation also has something like that (but perhaps for a different reason):

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)