The shape of tensor k is (seq_len, batch_size, embed_dim).
In activation.py:
bias_k = Parameter(torch.empty((1, 1, embed_dim)))
In functional.py:
# add bias along batch dimension (currently second)
k = torch.cat([k, bias_k.repeat(1, bsz, 1)])
However, the cat operation concat these two tensors in seq_len dimension.
Is this annotation # add bias along batch dimension (currently second) wrong?
I think it should be add bias along sequence length dimension