# Calculate_gain('tanh')

I understand the results returned by `calculate_gain` for linear, relu, leaky_relu and sigmoid.

Can anyone tell me why `calculate_gain('tanh')` returns 5/3 ?

2 Likes

On weight initialization in deep neural networks provides mathematical justification for using gain 1 with Tanh activation, and gain 3.6 with Sigmoid activation.

So I know nothing about these the paper you cite. And I haven’t looked at the papers for He’s and Glorot’s initialization, but here is Thomas Theory of Weight Initialization for the Theory Adverse™:

• Why do we care about initialization? We want “stability” of activation distributions in deep networks. If no “signal” arrives in layer 100, we are screwed.
• How do we measure “signal”? Let’s just take standard deviation.
• What is then a good gain? One where the standard deviation “converges” to a reasonable positive value (or stays in some region).
• Let’s not theorize, let’s just try:
``````with torch.no_grad():
a = torch.randn(1000,1000)
b = a
for i in range(10):
l = torch.nn.Linear(1000,1000, bias=False)
torch.nn.init.xavier_normal_(l.weight, torch.nn.init.calculate_gain('tanh'))
b = l(b).tanh()
print (f"in: {a.std().item():.4f}, out: {b.std().item():.4f}")
``````
• You will get something like `in: 0.9998, out: 0.6518`.
• You will get the more or less same thing (in particular 0.65x) when you take 20 or 100 layers instead of just 10. Stability!
• You will also get the same (output, not input) if you multiply a by 0.5 before feeding it in.
• It doesn’t work quite as nicely when you use relu for nonlinearity and gain.
• It will not work as well if you use 1 as gain for tanh.

The gain of 1 for tanh sounds like it is motivated by the derivative of 1 at 0. If that is the derivation, you might run into trouble with non-small variance.
I’m not sure I’ve seen many deep networks with sigmoid activations.

I seem to remember watching A. Kaparthy explain this in some CS231n lecture (with histograms of the activations).

Best regards

Thomas

3 Likes

Thank you for the data-driven perspective. I hadn’t thought of that approach.

I modified your snippet to show the mean absolute value of the gradient too.

``````import torch
import torch.nn.functional as F
import sys

b = a
print (f"in: {a.std().item():.4f}")
for i in range(100):
l = torch.nn.Linear(1000,1000, bias=False)
torch.nn.init.xavier_normal_(l.weight, float(sys.argv))
b = getattr(F, sys.argv)(l(b))
if i % 10 == 0:
print (f"out: {b.std().item():.4f}", end=" ")
b.sum().backward(retain_graph=True)
``````

A few results

• `sigmoid` seems stable with any gain > 0, but the gradients vanish pretty fast.
The more layers you have the higher the gain you will need.

• `tanh` seems stable with pretty much any gain > 1
With gain 5/3 the output stabilises at ~.65, but the gradients start to explode after around 10 layers
Gain 1.1 works much better, giving output std stable around 0.30 and grads that are much more stable though they do grow slowly

• `softsign` with gain 1 has slowly vanishing output and gradients
Gain > 1 reduces the vanishing, but higher values eventually cause the gradients to explode
The higher the gains the faster the gradients explode as you add layers

• `relu` seems to be inherently less stable than the others, but it works OK with gain ~= sqrt(2)

• `selu` only works with gain 1 and gives output std ~= 1.00, but the grad slowly explodes after 10-20 layers

4 Likes

So for these two:

Then that might work. My impression was that the “usual” way to counter exploding gradients was clipping. (More prominently in RNNs where tanh still is very common, too.)

I thought Klambauer et al, Self-Normalizing Neural Networks had the elaborate insights for this including gradients.

Best regards

Thomas

Regarding selu, the authors do have many elaborate insights that I don’t really understand. However they don’t use any models deeper than ~32 layers, and in my experiments the gradient doesn’t grow that much over 32 layers.

Either our experiments are somehow flawed, or we are misreading the paper. Now, I am not certain whether the paper claims that selus are not prone to exploding gradients, or whether they remain trainable regardless of any exploding gradients.

Philipp et al. The exploding gradient problem demystified find that models tend to either be prone to exploding gradients, or suffer from a collapsing domain, both of which hinder training. They suggest using either skip connections or orthogonal initialisation.

I would suggest using layerwise learning rates or an automated approach to adjusting the learning rate such as hyper-gradient descent to better cope with the differences in gradient size at different layers.

1 Like

Selu seems to stabilize well with a gain of 0.75:

``````in: 1.0013
``````

But I’m not sure why 0.75 is magic here or whether the scaling of the Glorot initialization achieves invariance under the layer size.

My takeaway was that there is a fixed point and I’m reasonably happy to try to find it myself. But don’t take that as advice, I am just a random clueless guy on the internet.

Best regards

Thomas

3 Likes

I understood it as: even after updating the weights, they will stay in the domain of the contraction mapping so that the fixed point does not change.

Just a quick update: Thanks to Ayrton San Joaquin the SELU gain we found here is now in PyTorch 1.8: Add SELU Activation to calculate_gain by ajsanjoaquin · Pull Request #50664 · pytorch/pytorch · GitHub

2 Likes