Calculate_gain('tanh')

So I know nothing about these the paper you cite. And I haven’t looked at the papers for He’s and Glorot’s initialization, but here is Thomas Theory of Weight Initialization for the Theory Adverse™:

  • Why do we care about initialization? We want “stability” of activation distributions in deep networks. If no “signal” arrives in layer 100, we are screwed.
  • How do we measure “signal”? Let’s just take standard deviation.
  • What is then a good gain? One where the standard deviation “converges” to a reasonable positive value (or stays in some region).
  • Let’s not theorize, let’s just try:
with torch.no_grad():
    a = torch.randn(1000,1000)
    b = a
    for i in range(10):
        l = torch.nn.Linear(1000,1000, bias=False)
        torch.nn.init.xavier_normal_(l.weight, torch.nn.init.calculate_gain('tanh'))
        b = l(b).tanh()
    print (f"in: {a.std().item():.4f}, out: {b.std().item():.4f}")
  • You will get something like in: 0.9998, out: 0.6518.
  • You will get the more or less same thing (in particular 0.65x) when you take 20 or 100 layers instead of just 10. Stability!
  • You will also get the same (output, not input) if you multiply a by 0.5 before feeding it in.
  • It doesn’t work quite as nicely when you use relu for nonlinearity and gain.
  • It will not work as well if you use 1 as gain for tanh.

The gain of 1 for tanh sounds like it is motivated by the derivative of 1 at 0. If that is the derivation, you might run into trouble with non-small variance.
I’m not sure I’ve seen many deep networks with sigmoid activations.

I seem to remember watching A. Kaparthy explain this in some CS231n lecture (with histograms of the activations).

Best regards

Thomas

5 Likes