I understand the results returned by calculate_gain for linear, relu, leaky_relu and sigmoid.

Can anyone tell me why calculate_gain('tanh') returns 5/3 ?


On weight initialization in deep neural networks provides mathematical justification for using gain 1 with Tanh activation, and gain 3.6 with Sigmoid activation.

So I know nothing about these the paper you cite. And I haven’t looked at the papers for He’s and Glorot’s initialization, but here is Thomas Theory of Weight Initialization for the Theory Adverse™:

  • Why do we care about initialization? We want “stability” of activation distributions in deep networks. If no “signal” arrives in layer 100, we are screwed.
  • How do we measure “signal”? Let’s just take standard deviation.
  • What is then a good gain? One where the standard deviation “converges” to a reasonable positive value (or stays in some region).
  • Let’s not theorize, let’s just try:
with torch.no_grad():
    a = torch.randn(1000,1000)
    b = a
    for i in range(10):
        l = torch.nn.Linear(1000,1000, bias=False)
        torch.nn.init.xavier_normal_(l.weight, torch.nn.init.calculate_gain('tanh'))
        b = l(b).tanh()
    print (f"in: {a.std().item():.4f}, out: {b.std().item():.4f}")
  • You will get something like in: 0.9998, out: 0.6518.
  • You will get the more or less same thing (in particular 0.65x) when you take 20 or 100 layers instead of just 10. Stability!
  • You will also get the same (output, not input) if you multiply a by 0.5 before feeding it in.
  • It doesn’t work quite as nicely when you use relu for nonlinearity and gain.
  • It will not work as well if you use 1 as gain for tanh.

The gain of 1 for tanh sounds like it is motivated by the derivative of 1 at 0. If that is the derivation, you might run into trouble with non-small variance.
I’m not sure I’ve seen many deep networks with sigmoid activations.

I seem to remember watching A. Kaparthy explain this in some CS231n lecture (with histograms of the activations).

Best regards



Thank you for the data-driven perspective. I hadn’t thought of that approach.

I modified your snippet to show the mean absolute value of the gradient too.

import torch
import torch.nn.functional as F
import sys

a = torch.randn(1000,1000, requires_grad=True)
b = a
print (f"in: {a.std().item():.4f}")
for i in range(100):
    l = torch.nn.Linear(1000,1000, bias=False)
    torch.nn.init.xavier_normal_(l.weight, float(sys.argv[2]))
    b = getattr(F, sys.argv[1])(l(b))
    if i % 10 == 0:
        print (f"out: {b.std().item():.4f}", end=" ")
        a.grad = None
        print (f"grad: {a.grad.abs().mean().item():.4f}")

A few results

  • sigmoid seems stable with any gain > 0, but the gradients vanish pretty fast.
    The more layers you have the higher the gain you will need.

  • tanh seems stable with pretty much any gain > 1
    With gain 5/3 the output stabilises at ~.65, but the gradients start to explode after around 10 layers
    Gain 1.1 works much better, giving output std stable around 0.30 and grads that are much more stable though they do grow slowly

  • softsign with gain 1 has slowly vanishing output and gradients
    Gain > 1 reduces the vanishing, but higher values eventually cause the gradients to explode
    The higher the gains the faster the gradients explode as you add layers

  • relu seems to be inherently less stable than the others, but it works OK with gain ~= sqrt(2)

  • selu only works with gain 1 and gives output std ~= 1.00, but the grad slowly explodes after 10-20 layers


So for these two:

Then that might work. My impression was that the “usual” way to counter exploding gradients was clipping. (More prominently in RNNs where tanh still is very common, too.)

I thought Klambauer et al, Self-Normalizing Neural Networks had the elaborate insights for this including gradients.

Best regards


Regarding selu, the authors do have many elaborate insights that I don’t really understand. However they don’t use any models deeper than ~32 layers, and in my experiments the gradient doesn’t grow that much over 32 layers.

Either our experiments are somehow flawed, or we are misreading the paper. Now, I am not certain whether the paper claims that selus are not prone to exploding gradients, or whether they remain trainable regardless of any exploding gradients.

Philipp et al. The exploding gradient problem demystified find that models tend to either be prone to exploding gradients, or suffer from a collapsing domain, both of which hinder training. They suggest using either skip connections or orthogonal initialisation.

I would suggest using layerwise learning rates or an automated approach to adjusting the learning rate such as hyper-gradient descent to better cope with the differences in gradient size at different layers.

1 Like

Selu seems to stabilize well with a gain of 0.75:

in: 1.0013
out: 0.7974 grad: 0.6252
out: 0.3138 grad: 0.2894
out: 0.2418 grad: 0.2414
out: 0.2097 grad: 0.2160
out: 0.2118 grad: 0.1993
out: 0.1939 grad: 0.1985
out: 0.1909 grad: 0.2259
out: 0.1849 grad: 0.2329
out: 0.2048 grad: 0.1998
out: 0.2060 grad: 0.2007

But I’m not sure why 0.75 is magic here or whether the scaling of the Glorot initialization achieves invariance under the layer size.

My takeaway was that there is a fixed point and I’m reasonably happy to try to find it myself. But don’t take that as advice, I am just a random clueless guy on the internet.

Best regards



I understood it as: even after updating the weights, they will stay in the domain of the contraction mapping so that the fixed point does not change.

Just a quick update: Thanks to Ayrton San Joaquin the SELU gain we found here is now in PyTorch 1.8: Add SELU Activation to calculate_gain by ajsanjoaquin · Pull Request #50664 · pytorch/pytorch · GitHub