I’m using the nn.MultiheadAttention layer (v1.1.0) with num_heads=19 and an input tensor of size [model_size,batch_size,embed_size]
Based on the original Attention is all you need paper, I understand that there should be a matrix of attention weights for each head (19 in my case), but i can’t find a way of accesing them. When doing a forward pass the returned weights have size [batch_size,model_size,model_size] instead of something like [batch_size, 19,model_size,model_size]. I’m guessing the weights returned are an average of all the heads but that isn’t specified in the docs.
Is there another way of accessing the full attention weights?
For posterity: a flag to disable averaging of attention weights across heads was added in #70055. You can now pass average_attn_weights=False to get attention weights per head.