Question on nn.MultiheadAttention

weerapatp · July 11, 2019, 11:51pm

Hello everyone, I’m trying to code the Transformer for the first time using nn.MultiheadAttention. However, I’m very confused. Could someone help me with these two questions? Reference to the documentation of nn.MultiheadAttention

Why is the batch size of query not the first dimension? (The size of query for nn.MultiheadAttention is [target length, batch size, embed dim]). I wish it is [batch size, target length, embed dim] instead. I thought that torch.utils.data.DataLoader always put batch size to be the first dimension, and most other neural networks also put batch size in the first dimension. What is the reason behind this? I would like to know in case I’m missing something.
nn.MultiheadAttention returns attn_output and attn_output_weights. I would like to know why it returns attn_output_weights. Wouldn’t it be sufficient to just use the output “attn_output”? Do I need to use attn_output_weights somewhere in the Transformer?

andreabac3 · December 22, 2020, 9:12am

UP! I am also interested on that.

googlebot · December 22, 2020, 2:44pm

I think this order is needed for internal bmm operation, that does the main job. It is also consistent with RNNs’ default.
as need_weights argument may suggest, weights can be ignored.

veritas · February 9, 2021, 7:07am

@ptrblck Hello. I am also very interested in this question. Setting the sequence dimension to the first dimension would make the matrix multiply extremely inefficient due to memory access issues.
Is this really because of compatibility with RNNs?
If completely designed from scratch, the most reasonable order would appear to be (N, E, S) instead of the current (S, N, E) order. This would allow the sequence length axes, which are the ones being multiplied in the (self-attention matrix @ value matrix) operation, to be contiguous. As this is the most computationally intensive, this operation should be the most heavily optimized.
cuDNN and cuBLAS optimizations mean that (N, S, E) order, which is called batch_first order in torchtext, will probably have similar performance, though this should be tested.