How do I implement Attention and its variants?


I have just started working with neural networks. I am having trouble in mapping of concepts given in paper to code.

I had to ask some questions:

  1. The nn module in PyTorch accepts a batch as its parameter by default? i.e. I prepare my input in the format of batch x features x feature_length form and then when defining my network class I can just ignore the batch parameter?

  2. Isn’t seq2seq model just a normal lstm with a softmax decoder? I can put another layer before softmax and it would still be an encoder, what is the difference here b/w seq2seq and just more layers?

  3. How do i implement attention, is it a layer? I think in keras they just take the dot product of the hidden states of lstm and get scores which is then multiplied back to inputs and added to the final hidden state. Is attention any different? I don’t think it has any tunable parameters but still in many places in Pytorch, they implement it as a layer i.e. a class, why and how is that and what is the benefit?

  4. How do I implement self attention? Here i don’t even have a clue? Is it a linear layer, how to do this?

This is very hard it appears, thanks for helping! Even some pointers would be great.