In the initialization of the Q, K and V matrices of the multihead attention layer the default is an inconsistent initialization if self._qkv_same_embed_dim = True, i.e. when k_dim = q_dim = embed_dim. In this case, for efficiency purposes the in_proj_weight is of shape (3 x embed_dim, embed_dim) but the fact that fan_out = 3 x embed_dim is only for efficiency purposes and this packed tensor will be chunked later on. xavier_uniform_ however depends on the fan_out.
A fix would be to put gain = 2 in line 1231 so that it gives the same output as applying xavier_uniform_ separately to self.q_proj_weight, self.k_proj_weight and self.v_proj_weight
Interesting finding. I did not find a good reason why there is no compensation for gain in this scenario. I think you meant gain = sqrt(2) will solve the issue?
Might be a good idea to create a GitHub issue on this.