"Key" bias in pytorch multi-head attention is redundant

The pytorch doc does not really indicate how they implement biases in torch.nn.MultiheadAttention. It only refers to the original paper (Vaswani et al) which does not use biases. A bit of common sense, and reverse engineering, allows to understand how biases are inserted: instead of applying a linear transform to the key and query inputs, they use an affine transform (i.e. linear + bias).

Now writing down the formula shows that the “key” bias is redundant, because it adds the same term to each cell of a vector (actually each row of a matrix) which is then passed to a softmax (and it is well known that softmax is invariant by such an operation).

To confirm, I ran a pytorch mha twice with the same input, changing only the “key” bias from one run to the next, and observed no difference in the result. As a sanity check, changing the “query” or “value” biases instead did yield a difference.

Snippet at


  • waste of memory to store the useless bias
  • waste of computation time in forward (adding the bias) and backward (propagating null gradients)

It is probably not significant in most applications, but could easily be removed.