Add_bias_kv in MultiheadAttention

Vanda_Alicia · July 7, 2024, 10:49am

The shape of tensor k is (seq_len, batch_size, embed_dim).

In activation.py:
bias_k = Parameter(torch.empty((1, 1, embed_dim)))

In functional.py:

# add bias along batch dimension (currently second)
k = torch.cat([k, bias_k.repeat(1, bsz, 1)])

However, the cat operation concat these two tensors in seq_len dimension.
Is this annotation # add bias along batch dimension (currently second) wrong?
I think it should be add bias along sequence length dimension