What is gain value in nn.init.calculate_gain(nonlinearity, param=None) and how to use these gain number and why?

DonnyHe · September 30, 2021, 1:33pm

hey zihao - excuse my slow reply.

i used the following references to find out where the formula came from more or less https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf, https://arxiv.org/pdf/1502.01852.pdf.

your estimation does not work for all cases. for example for relu, the post activation value y will be a noncentered variable, and then if you multiply by a weight matrix with \sigma_w, it’s variance will be var(w) (var(y) + mean(y)^2), which will need to be accounted for, and you have no way of knowing what E(y) will be, as its value depends on the layer. Hence your way of estimation will not work for every layer, this contradicts the fact that information gain is applied uniformly across all layers.

As a result, I recommend reading the papers (section 4.2 and 2.2 respectively) as they derive and explain how they come up with the scaling factor which is information gain here quite well