Multihead Attention in_proj is initialized inconsistently

In the initialization of the Q, K and V matrices of the multihead attention layer the default is an inconsistent initialization if self._qkv_same_embed_dim = True, i.e. when k_dim = q_dim = embed_dim. In this case, for efficiency purposes the in_proj_weight is of shape (3 x embed_dim, embed_dim) but the fact that fan_out = 3 x embed_dim is only for efficiency purposes and this packed tensor will be chunked later on. xavier_uniform_ however depends on the fan_out.

A fix would be to put gain = 2 in line 1231 so that it gives the same output as applying xavier_uniform_ separately to self.q_proj_weight, self.k_proj_weight and self.v_proj_weight

Interesting finding. I did not find a good reason why there is no compensation for gain in this scenario. I think you meant gain = sqrt(2) will solve the issue?

Might be a good idea to create a GitHub issue on this.

1 Like

Thanks for the suggestion and the correction, yes it should be sqrt(2). I did make it an issue (my first), in the Github repo.

Thanks. Just putting github issue for tracking: Multihead Attention's in_proj_weight is initialized inconsistently · Issue #166378 · pytorch/pytorch · GitHub