Hi,
I have implemented the following attention function, adapting it from an other implementation
(https://github.com/littleflow3r/attention-bilstm-for-relation-classification/blob/master/model.py),
and the final result is:
def attention(out, hidden):
out = out.permute(1,0,2)
hidden = hidden.squeeze(0)
attn_weights = torch.einsum('pqr,pr->pq', [out, hidden])
soft_attn_weights = F.softmax(attn_weights, 1)
new_hid = torch.einsum('pqr,pq->pr', [out, soft_attn_weights])
return new_hid
this is intended to be used after an lstm, and works in my code.
Calling it with
out: torch.Size([2, 2, 3])
hidden: torch.Size([2, 3])
gives torch.Size([2, 3])
But if I use allennlp attention implementations, I have:
linear = CosineAttention(normalize=False)
output = linear(
torch.FloatTensor([[0, 0, 0], [1, 1, 1]]),
torch.FloatTensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]),
)
output.shape
torch.Size([2, 2])
[2, 3], [2, 2, 3] ---> [2, 2]
linear = DotProductAttention(normalize=False)
output = linear(
torch.FloatTensor([[0, 0, 0], [1, 1, 1]]),
torch.FloatTensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]),
)
output.shape
torch.Size([2, 2])
[2, 3], [2, 2, 3] ---> [2, 2]
Can you provide some feedback? Is my implementation wrong?
Is just an implementation for another use case? I need some theoretical feedback.