Multihead Attention in_proj is initialized inconsistently

emk · October 26, 2025, 1:57pm

In the initialization of the Q, K and V matrices of the multihead attention layer the default is an inconsistent initialization if self._qkv_same_embed_dim = True, i.e. when k_dim = q_dim = embed_dim. In this case, for efficiency purposes the in_proj_weight is of shape (3 x embed_dim, embed_dim) but the fact that fan_out = 3 x embed_dim is only for efficiency purposes and this packed tensor will be chunked later on. xavier_uniform_ however depends on the fan_out.

A fix would be to put gain = 2 in line 1231 so that it gives the same output as applying xavier_uniform_ separately to self.q_proj_weight, self.k_proj_weight and self.v_proj_weight

github.com/pytorch/pytorch

torch/nn/modules/activation.py

v2.9.0


      
                  self.bias_v = Parameter(torch.empty((1, 1, embed_dim), **factory_kwargs))
              else:
                  self.bias_k = self.bias_v = None
          
              self.add_zero_attn = add_zero_attn
          
              self._reset_parameters()
          
          def _reset_parameters(self) -> None:
              if self._qkv_same_embed_dim:
                  xavier_uniform_(self.in_proj_weight)
              else:
                  xavier_uniform_(self.q_proj_weight)
                  xavier_uniform_(self.k_proj_weight)
                  xavier_uniform_(self.v_proj_weight)
          
              if self.in_proj_bias is not None:
                  constant_(self.in_proj_bias, 0.0)
                  constant_(self.out_proj.bias, 0.0)
              if self.bias_k is not None:
                  xavier_normal_(self.bias_k)

InnovArul · October 27, 2025, 2:29pm

Interesting finding. I did not find a good reason why there is no compensation for gain in this scenario. I think you meant gain = sqrt(2) will solve the issue?

Might be a good idea to create a GitHub issue on this.

emk · October 28, 2025, 6:43am

Thanks for the suggestion and the correction, yes it should be sqrt(2). I did make it an issue (my first), in the Github repo.

InnovArul · October 29, 2025, 7:51pm

Thanks. Just putting github issue for tracking: Multihead Attention's in_proj_weight is initialized inconsistently · Issue #166378 · pytorch/pytorch · GitHub