Is there a way the get the score of the this Layer? Like the F.softmax((Q@K.t())/torch.sqrt(dim),-1) before it is multiplied with V.
if we do,
import torch.nn as nn, torch
x = nn.TransformerEncoderLayer(10, 2)
y = nn.TransformerEncoder(x, 1)
src = torch.randn(1, 1, 10)
x.self_attn(src, src, src)
then we get,
(tensor([[[-0.1861, 0.1664, 0.0857, -0.2807, -0.2680, -0.1627, 0.0585,
0.1379, -0.0257, -0.0476]]], grad_fn=<AddBackward0>),
tensor([[[1.1111]]], grad_fn=<DivBackward0>))
the second output is average attention weights over heads