Apply `MultiheadAttention`'s heads to same input

erezinman · February 2, 2023, 12:10am

Hi,
My question surely has a simple answer, but I couldn’t find it. I
wish to apply MultiheadAttention to the same sequence without copying the sequence. My data is temporal data with dimensions (batch, time, channels). I treat the “channels” dimension as the embedding, and the time dimension as the sequence dimension. For example:

N, C, T = 2, 3, 5
n_heads = 7
X = torch.rand(N, T, C)

Now, I want to apply 7 different heads as self-attention to the same input X, but as far as I understand, it attrequires me to copy the data 7 times:

attn = torch.nn.MultiheadAttention(C * n_heads, n_heads, batch_first=True)
X_ = X.repeat(1, 1, n_heads)
attn(X_, X_, X_)

Is there any way to do this without copying the data 7 times?
Thanks!