# Why does the transformer tutorial have a multiplication by square root of the number of inputs?

Why does the transformer tutorial have a multiplication by sqrt number of inputs? I know there is a division by sqrt(D) in the multiheaded self attention, but why is there one with the output of the encoder?

In particular:

src = self.encoder(src) * math.sqrt(self.ninp)


Reference code:

import math

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer

class TransformerModel(nn.Module):

def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
super(TransformerModel, self).__init__()
self.model_type = 'Transformer'
self.pos_encoder = PositionalEncoding(ninp, dropout)
encoder_layers = TransformerEncoderLayer(ninp, nhead, nhid, dropout)
self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
self.encoder = nn.Embedding(ntoken, ninp)
self.ninp = ninp
self.decoder = nn.Linear(ninp, ntoken)

self.init_weights()

mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)

def init_weights(self):
initrange = 0.1
self.encoder.weight.data.uniform_(-initrange, initrange)
self.decoder.bias.data.zero_()
self.decoder.weight.data.uniform_(-initrange, initrange)

src = self.encoder(src) * math.sqrt(self.ninp)
src = self.pos_encoder(src)
output = self.decoder(output)
return output


or

self.embedding(tokens.long()) * math.sqrt(self.emb_size)


I’m not an expert in NLP (so double check with an expert), but I believe the square root is to normalize the encoding with respect to the input size so the embedding isn’t dependent on the size of the input. As you’re taking a dot product, the magnitude of the dot product with scale with the square-root of the dimension size and by dividing by that value you normalize it!

1 Like

hmmm…my understanding is that there are two of these going on.

1. One inside the MHA:
embed  = sf(QK/d**0.5)V


but the one here

1. is the following:
emebd = encoder(x) * T_x**0.5


which seems different. If you argument holds if anything I would have expected to divide by T_x**0.5 not multiply it… though your comment gave me something to think about…

I’m not too sure, what exactly is T_x? If I had to guess I’d say maybe you normalize out the dimension within the self-attention mechanism but add it back in after the calculation is done to preserve the dimensionality of the input?

T_x = sequence length for input. Usually the sequence length is denoted with a t e.g. x^<t> and the subscript x to emphasizes its the input.

On a related note it seems that the other tutorial Language Translation with nn.Transformer and torchtext — PyTorch Tutorials 1.9.0+cu102 documentation also has something like that (but perhaps for a different reason):

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size: int, emb_size):
super(TokenEmbedding, self).__init__()
self.embedding = nn.Embedding(vocab_size, emb_size)
self.emb_size = emb_size

def forward(self, tokens: Tensor):
return self.embedding(tokens.long()) * math.sqrt(self.emb_size)
`

@albanD apologies for the direct tag. Do you know who to tag that might know the answer to this?

edit2:
@ptrblck apologies for the direct tag. Do you know who to tag that might know the answer to this?

1 Like

You might don’t need this anymore, but the answer is: