MultiHeadAttention Activation Initialization

JFernando4 · January 22, 2025, 2:02am

In the current version, in the MultiheadAttention activation in torch.nn.modules.activation, in the _reset_parameters method in line 1116, all the parameters of the activations are initialized using Xavier initialization except for the output projection weight. The output projection weight is missing and is likely being initialized using the default initialization for linear layers. This is the method:

def _reset_parameters(self):
        if self._qkv_same_embed_dim:
            xavier_uniform_(self.in_proj_weight)
        else:
            xavier_uniform_(self.q_proj_weight)
            xavier_uniform_(self.k_proj_weight)
            xavier_uniform_(self.v_proj_weight)

        if self.in_proj_bias is not None:
            constant_(self.in_proj_bias, 0.0)
            constant_(self.out_proj.bias, 0.0)
        if self.bias_k is not None:
            xavier_normal_(self.bias_k)
        if self.bias_v is not None:
            xavier_normal_(self.bias_v)

Since self.out_proj.weight is missing in this function, it is initialized as: torch.nn.init.kaiming_uniform_(self.out_proj.weight, a=math.sqrt(5)) inside of the linear layer reset_parameters method here.

I was wondering if this was an intentional choice or if it’s a bug.