The essence of learnable positional embedding? Does embedding improve outcomes better?

Hi all, I was recently reading the bert source code from the hugging face project. I noticed that the so-called “learnable position encoding” seems to refer to a specific nn.Parameter layer when it comes to implementation.

def __init__(self):
    positional_encoding = nn.Parameter()
def forward(self, x):
    x += positional_encoding

↑ Could be this feeling, then performed the learnable position encoding. Whether that means it’s that simple or not, I’m not sure I understand it correctly, I want to ask someone with experience.

In addition, I noticed a classic bert structure whose location is actually coded only once at the initial input. Does this mean that the subsequent bert layers, for each other, lose the ability to capture location information?

  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(...)
  (pooler): BertPooler(...)

Would I get better results if the results of the previous layer were re-positional encoded before the next bert layer?


Well I don’t know the case for bert specifically. The positional encoding is a kind of information you pass at the beginning. Once that’s done, subsequent layers can manage that info to make use of it in an optimal way. So yes, subsequent layers are aware of the position.

I don’t understand the question about the learnable one. Indeed, you just pass sum a signal which is nothing but learnable floats.