Hello everyone, I’m trying to code the Transformer for the first time using nn.MultiheadAttention. However, I’m very confused. Could someone help me with these two questions? Reference to the documentation of nn.MultiheadAttention
Why is the batch size of query not the first dimension? (The size of query for nn.MultiheadAttention is [target length, batch size, embed dim]). I wish it is [batch size, target length, embed dim] instead. I thought that torch.utils.data.DataLoader always put batch size to be the first dimension, and most other neural networks also put batch size in the first dimension. What is the reason behind this? I would like to know in case I’m missing something.
nn.MultiheadAttention returns attn_output and attn_output_weights. I would like to know why it returns attn_output_weights. Wouldn’t it be sufficient to just use the output “attn_output”? Do I need to use attn_output_weights somewhere in the Transformer?