"Key" bias in pytorch multi-head attention is redundant

jmandreoli · September 16, 2022, 1:11pm

The pytorch doc does not really indicate how they implement biases in torch.nn.MultiheadAttention. It only refers to the original paper (Vaswani et al) which does not use biases. A bit of common sense, and reverse engineering, allows to understand how biases are inserted: instead of applying a linear transform to the key and query inputs, they use an affine transform (i.e. linear + bias).

Now writing down the formula shows that the “key” bias is redundant, because it adds the same term to each cell of a vector (actually each row of a matrix) which is then passed to a softmax (and it is well known that softmax is invariant by such an operation).

To confirm, I ran a pytorch mha twice with the same input, changing only the “key” bias from one run to the next, and observed no difference in the result. As a sanity check, changing the “query” or “value” biases instead did yield a difference.

Snippet at

github.com

jmandreoli/Public/blob/main/mha.py

# The purpose of this snippet is to show that the "key" bias in pytorch multi-head attention is redundant
# Consequences:
# * memory space for that bias is wasted
# * torch spends time computing operations involving that bias and propagating null gradients back
# Probably not significant, but still could be avoided

import torch
MHA = torch.nn.MultiheadAttention(embed_dim=12,num_heads=4,batch_first=True)
# variant with different key and value dimensions:
#MHA = torch.nn.MultiheadAttention(embed_dim=12,num_heads=4,batch_first=True,kdim=17,vdim=19)
B = 13; M = 10; N = 8 # batch size, input size, output size
output_shape = (B,N,MHA.embed_dim) # expected shape of the result
biases = dict(zip(('query','key','value'),torch.chunk(MHA.in_proj_bias.data,3)))
def test(bias_name='key'): # either 'key', 'query' or 'value'
  query_input = torch.rand(B,N,MHA.embed_dim) # query
  key_input = torch.rand(B,M,MHA.kdim) # key
  value_input =  torch.rand(B,M,MHA.vdim) # value
  with torch.no_grad():
    # Retrieve the bias
    bias = biases[bias_name]

This file has been truncated. show original

Consequences:

waste of memory to store the useless bias
waste of computation time in forward (adding the bias) and backward (propagating null gradients)

It is probably not significant in most applications, but could easily be removed.