It seems like the two options for analyzing the attention weights are to
A. Take the average across the heads dim
B. Look at each individual head
but isn’t what’s happening inside the transformer that they the attention weights are combined from 16 heads back to 1 via a linear layer, meaning that to actually interpret how the model uses the attention weights, we would want to look at how the 16 heads are combined, by returning the output of the linear layer? If so, why does pytorch opt to return the average across the dims instead of weighting the heads in the same way as the model. I’m sure I’m misunderstanding here so I would appreciate it if someone could point out where I’m going wrong. Thank you.