Is there any simple example of implementing a residual connection with nn.MultiheadAttentiont
.
I am asking because I wonder what happens with masked elements. For example if an element is masked in a multi-head attention layer then it comes back in the residual connection after and that element information is not blocked anymore.