I’ve implemented a transformer model following along with Peter Bloem’s blog
I find myself confused by the high level meaning of the position embeddings. When I look at papers/articles describing position embeddings, they all seem to indicate we embed the positions in individual sentences, which makes sense.
But if you look at the code accompanying Peter Bloem’s blog, it seems the position embeddings are for the entire sequence (i.e., potentially many sentences). The position embedding layer is defined as
nn.Embedding(a, b) where a equals the dimension of the word embedding vectors, and b is set to the length of the longest sequence (I believe 512).
Does this mean we are creating position vectors for 512 different positions? If so, I feel like that doesn’t make sense. The first word of a sentence could be at position 1 and position 242 in one case, but in another case position 242 could be the last word in a sentence (or any word).
I used the same style of position embedding as did Bloem - that is my position embedding layer is
nn.Embedding(word_embedding_size, len_longest_sequence). I am getting good results, yet I feel quite confused about the position embeddings.
I think your understanding is correct. Transformers do not encode the sequential nature of their inputs. Hence, we need positional encoding to add that notion during training. For an input sequence of length 512, it can consist of multiple sentences attached together and fed in a sequence.
eg: for an input sequence of length 8:
a1 a2 a3 a4. b1 b2 b3 b4 which consist of two sentences:
a1 a2 a3 a4 and
b1 b2 b3 b4, the corresponding positions would be
1 2 3 4 5 6 7 8. This is still valid because
a1 comes before
b1 comes before
b2. Although, it also implies
a1 comes before
b1 and that’s fine.
Thank you for your reply! I’m glad I am on the right track. However, I still feel confused about the concept of position embedding for an entire sequence of sentences.
Suppose another input sequence is also length 8:
c1 c2 c3 c4 c5. d1 d2 d3. Our position 5 corresponds to the last word of a sentence in my sample and the first word of the sentence in your sample. Position 5 could be the middle of a sentence for a third sample. What exactly is position 5 learning then?
The embedding vectors for “cat” vs. “dog” make sense - they are different words that can be mapped to a similar vector space (e.g., animal, four legs, etc.). But the meaning of position 5 seems random in this example. It can be early sentence, mid-sentence, or late sentence. The only deterministic thing about it seems to be that it comes after position 4 and before position 6. But positions 4 and 6 can mean different things depending on the sample (similarly, positions 254 and 256 can mean different things).
I plan to spend a lot of time studying this and related issues to come to a deeper understanding. But in the meantime I am hoping to gain some high level intuition to what the position embeddings mean. I found a paper that seems to get at the question. I hope to find some insights.