Attention at decoder side in a Seq2seq model not working properly

My implementation of attention at decoder side is :

class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Parameter(torch.rand(dec_hid_dim))
    def forward(self, hidden, encoder_outputs):
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src sent len, batch size, enc hid dim * 2]
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        #hidden = [batch size, src sent len, dec hid dim]
        #encoder_outputs = [batch size, src sent len, enc hid dim * 2]
        energy = torch.tanh(self.attn(, encoder_outputs), dim=2))) 

        #energy = [batch size, src sent len, dec hid dim]
        energy = energy.permute(0, 2, 1)
        #energy = [batch size, dec hid dim, src sent len]
        #v = [dec hid dim]
        v = self.v.repeat(batch_size, 1).unsqueeze(1)
        #v = [batch size, 1, dec hid dim]
        attention = torch.bmm(v, energy).squeeze(1)
        return F.softmax(attention, dim=1)

Please let me know if the above implementation is wrong ?

Recently found another implementation: , It uses tanh , softmax and Relu . Is this right ?

I have seen many calculation ways
So, I think that Attention has no unified form.

Then how to know which form to use when ? like taking softmax vs Relu at the end would make a lot of difference right ?

sorry, I can not answer this problem because i do not do comparative experiment.
I generally use the attention way from this paper (Effective Approaches to Attention-based Neural Machine Translation)