Is nn.MultiheadAttention in pytorch just a linear transformation layer, while nn.TransformerEncoderLayer the combination of nn.MultiheadAttention and a feed forward layer?
In addition, here is an example of nn.TransformerEncoderLayer:
After a brief look through the source code, MultiHeadAttention calls multi_head_attention_forward which contains a softmax call as seen in the source here. So, it does seem to have a non-linear function of some kind. @ptrblck (apologizes for the tag) will be able to confirm any more specific details! But the source does show a non-linear function being used.
Thanks again. I checked the source too and found there is a linear layer in MultiheadAttention function that maps derived attention output in embed_dim to embed_dim at the end:
So I guess there is an extra linear transformation before pytorch outputs the attention than what I originally expected. @ptrblck@AlphaBetaGamma96 May you have a look, thanks!