Hi all, I was recently reading the bert source code from the hugging face project. I noticed that the so-called “learnable position encoding” seems to refer to a specific nn.Parameter layer when it comes to implementation.
def __init__(self): super() positional_encoding = nn.Parameter() def forward(self, x): x += positional_encoding
↑ Could be this feeling, then performed the learnable position encoding. Whether that means it’s that simple or not, I’m not sure I understand it correctly, I want to ask someone with experience.
In addition, I noticed a classic bert structure whose location is actually coded only once at the initial input. Does this mean that the subsequent bert layers, for each other, lose the ability to capture location information?
BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer(...) ... (pooler): BertPooler(...)
Would I get better results if the results of the previous layer were re-positional encoded before the next bert layer?