But nn.MultiheadAttention the is just for sequence models.
I guess you meant some techniques to apply attention to convolution networks.
Attention is like a new wave for convnets.
You can do it either by changing the architecture or changing the loss function or both.
The problem with convolution is that it has local receptive field.
Opposite to that fc layers have the global receptive field. So the idea to combine that using SE blocks is here.
Also, there is an idea to concat convolution and attentional featuremaps here.
We now formally describe our proposed Attention Augmentation method.
We use the following naming conventions: H, W and Fin refer to the height,
width and number
of input filters of an activation map. Nh, dv and dk respectively refer the
number of heads, the depth of values and the
depth of queries and keys in multihead-attention (MHA).
We further assume that Nh divides dv and dk evenly and
denote dhv and dhk the depth of values and queries/keys per
attention head.
if I do nn.MultiheadAttention(28, 2), then Nh = 2, but, dv, dk, dhv, dhk = ???
If I want to transform an image to another image, then