Calculate_gain('tanh')

tom · July 8, 2018, 12:19pm

So I know nothing about these the paper you cite. And I haven’t looked at the papers for He’s and Glorot’s initialization, but here is Thomas Theory of Weight Initialization for the Theory Adverse™:

Why do we care about initialization? We want “stability” of activation distributions in deep networks. If no “signal” arrives in layer 100, we are screwed.
How do we measure “signal”? Let’s just take standard deviation.
What is then a good gain? One where the standard deviation “converges” to a reasonable positive value (or stays in some region).
Let’s not theorize, let’s just try:

with torch.no_grad():
    a = torch.randn(1000,1000)
    b = a
    for i in range(10):
        l = torch.nn.Linear(1000,1000, bias=False)
        torch.nn.init.xavier_normal_(l.weight, torch.nn.init.calculate_gain('tanh'))
        b = l(b).tanh()
    print (f"in: {a.std().item():.4f}, out: {b.std().item():.4f}")

You will get something like in: 0.9998, out: 0.6518.
You will get the more or less same thing (in particular 0.65x) when you take 20 or 100 layers instead of just 10. Stability!
You will also get the same (output, not input) if you multiply a by 0.5 before feeding it in.
It doesn’t work quite as nicely when you use relu for nonlinearity and gain.
It will not work as well if you use 1 as gain for tanh.

The gain of 1 for tanh sounds like it is motivated by the derivative of 1 at 0. If that is the derivation, you might run into trouble with non-small variance.
I’m not sure I’ve seen many deep networks with sigmoid activations.

I seem to remember watching A. Kaparthy explain this in some CS231n lecture (with histograms of the activations).

Best regards

Thomas