Custom implementation of attention

Hi,
I have implemented the following attention function, adapting it from an other implementation
(https://github.com/littleflow3r/attention-bilstm-for-relation-classification/blob/master/model.py),
and the final result is:

def attention(out, hidden):
    out = out.permute(1,0,2)
    hidden = hidden.squeeze(0)
    attn_weights = torch.einsum('pqr,pr->pq', [out, hidden])
    soft_attn_weights = F.softmax(attn_weights, 1)
    new_hid = torch.einsum('pqr,pq->pr', [out, soft_attn_weights])
    return new_hid

this is intended to be used after an lstm, and works in my code.

Calling it with

out: torch.Size([2, 2, 3])
hidden: torch.Size([2, 3])

gives torch.Size([2, 3])

But if I use allennlp attention implementations, I have:

linear = CosineAttention(normalize=False)
output = linear(
            torch.FloatTensor([[0, 0, 0], [1, 1, 1]]),
            torch.FloatTensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]),
        )
output.shape
torch.Size([2, 2])

[2, 3], [2, 2, 3] ---> [2, 2]

linear = DotProductAttention(normalize=False)
output = linear(
  torch.FloatTensor([[0, 0, 0], [1, 1, 1]]),
  torch.FloatTensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]),
)
output.shape
torch.Size([2, 2])

[2, 3], [2, 2, 3] ---> [2, 2]

Can you provide some feedback? Is my implementation wrong?
Is just an implementation for another use case? I need some theoretical feedback.