The pytorch doc does not really indicate how they implement biases in torch.nn.MultiheadAttention
. It only refers to the original paper (Vaswani et al) which does not use biases. A bit of common sense, and reverse engineering, allows to understand how biases are inserted: instead of applying a linear transform to the key and query inputs, they use an affine transform (i.e. linear + bias).
Now writing down the formula shows that the “key” bias is redundant, because it adds the same term to each cell of a vector (actually each row of a matrix) which is then passed to a softmax (and it is well known that softmax is invariant by such an operation).
To confirm, I ran a pytorch mha twice with the same input, changing only the “key” bias from one run to the next, and observed no difference in the result. As a sanity check, changing the “query” or “value” biases instead did yield a difference.
Snippet at
Consequences:
- waste of memory to store the useless bias
- waste of computation time in forward (adding the bias) and backward (propagating null gradients)
It is probably not significant in most applications, but could easily be removed.