In the current version, in the MultiheadAttention
activation in torch.nn.modules.activation
, in the _reset_parameters
method in line 1116, all the parameters of the activations are initialized using Xavier initialization except for the output projection weight. The output projection weight is missing and is likely being initialized using the default initialization for linear layers. This is the method:
def _reset_parameters(self):
if self._qkv_same_embed_dim:
xavier_uniform_(self.in_proj_weight)
else:
xavier_uniform_(self.q_proj_weight)
xavier_uniform_(self.k_proj_weight)
xavier_uniform_(self.v_proj_weight)
if self.in_proj_bias is not None:
constant_(self.in_proj_bias, 0.0)
constant_(self.out_proj.bias, 0.0)
if self.bias_k is not None:
xavier_normal_(self.bias_k)
if self.bias_v is not None:
xavier_normal_(self.bias_v)
Since self.out_proj.weight
is missing in this function, it is initialized as: torch.nn.init.kaiming_uniform_(self.out_proj.weight, a=math.sqrt(5))
inside of the linear layer reset_parameters
method here.
I was wondering if this was an intentional choice or if it’s a bug.