So I know nothing about these the paper you cite. And I haven’t looked at the papers for He’s and Glorot’s initialization, but here is Thomas Theory of Weight Initialization for the Theory Adverse™:
- Why do we care about initialization? We want “stability” of activation distributions in deep networks. If no “signal” arrives in layer 100, we are screwed.
- How do we measure “signal”? Let’s just take standard deviation.
- What is then a good gain? One where the standard deviation “converges” to a reasonable positive value (or stays in some region).
- Let’s not theorize, let’s just try:
with torch.no_grad():
a = torch.randn(1000,1000)
b = a
for i in range(10):
l = torch.nn.Linear(1000,1000, bias=False)
torch.nn.init.xavier_normal_(l.weight, torch.nn.init.calculate_gain('tanh'))
b = l(b).tanh()
print (f"in: {a.std().item():.4f}, out: {b.std().item():.4f}")
- You will get something like
in: 0.9998, out: 0.6518
. - You will get the more or less same thing (in particular 0.65x) when you take 20 or 100 layers instead of just 10. Stability!
- You will also get the same (output, not input) if you multiply a by 0.5 before feeding it in.
- It doesn’t work quite as nicely when you use relu for nonlinearity and gain.
- It will not work as well if you use 1 as gain for tanh.
The gain of 1 for tanh sounds like it is motivated by the derivative of 1 at 0. If that is the derivation, you might run into trouble with non-small variance.
I’m not sure I’ve seen many deep networks with sigmoid activations.
I seem to remember watching A. Kaparthy explain this in some CS231n lecture (with histograms of the activations).
Best regards
Thomas