I understand the results returned by calculate_gain
for linear, relu, leaky_relu and sigmoid.
Can anyone tell me why calculate_gain('tanh')
returns 5/3 ?
I understand the results returned by calculate_gain
for linear, relu, leaky_relu and sigmoid.
Can anyone tell me why calculate_gain('tanh')
returns 5/3 ?
On weight initialization in deep neural networks provides mathematical justification for using gain 1 with Tanh activation, and gain 3.6 with Sigmoid activation.
So I know nothing about these the paper you cite. And I havenâ€™t looked at the papers for Heâ€™s and Glorotâ€™s initialization, but here is Thomas Theory of Weight Initialization for the Theory Adverseâ„˘:
with torch.no_grad():
a = torch.randn(1000,1000)
b = a
for i in range(10):
l = torch.nn.Linear(1000,1000, bias=False)
torch.nn.init.xavier_normal_(l.weight, torch.nn.init.calculate_gain('tanh'))
b = l(b).tanh()
print (f"in: {a.std().item():.4f}, out: {b.std().item():.4f}")
in: 0.9998, out: 0.6518
.The gain of 1 for tanh sounds like it is motivated by the derivative of 1 at 0. If that is the derivation, you might run into trouble with non-small variance.
Iâ€™m not sure Iâ€™ve seen many deep networks with sigmoid activations.
I seem to remember watching A. Kaparthy explain this in some CS231n lecture (with histograms of the activations).
Best regards
Thomas
Thank you for the data-driven perspective. I hadnâ€™t thought of that approach.
I modified your snippet to show the mean absolute value of the gradient too.
import torch
import torch.nn.functional as F
import sys
a = torch.randn(1000,1000, requires_grad=True)
b = a
print (f"in: {a.std().item():.4f}")
for i in range(100):
l = torch.nn.Linear(1000,1000, bias=False)
torch.nn.init.xavier_normal_(l.weight, float(sys.argv[2]))
b = getattr(F, sys.argv[1])(l(b))
if i % 10 == 0:
print (f"out: {b.std().item():.4f}", end=" ")
a.grad = None
b.sum().backward(retain_graph=True)
print (f"grad: {a.grad.abs().mean().item():.4f}")
A few results
sigmoid
seems stable with any gain > 0, but the gradients vanish pretty fast.
The more layers you have the higher the gain you will need.
tanh
seems stable with pretty much any gain > 1
With gain 5/3 the output stabilises at ~.65, but the gradients start to explode after around 10 layers
Gain 1.1 works much better, giving output std stable around 0.30 and grads that are much more stable though they do grow slowly
softsign
with gain 1 has slowly vanishing output and gradients
Gain > 1 reduces the vanishing, but higher values eventually cause the gradients to explode
The higher the gains the faster the gradients explode as you add layers
relu
seems to be inherently less stable than the others, but it works OK with gain ~= sqrt(2)
selu
only works with gain 1 and gives output std ~= 1.00, but the grad slowly explodes after 10-20 layers
So for these two:
Then that might work. My impression was that the â€śusualâ€ť way to counter exploding gradients was clipping. (More prominently in RNNs where tanh still is very common, too.)
I thought Klambauer et al, Self-Normalizing Neural Networks had the elaborate insights for this including gradients.
Best regards
Thomas
Regarding selu, the authors do have many elaborate insights that I donâ€™t really understand. However they donâ€™t use any models deeper than ~32 layers, and in my experiments the gradient doesnâ€™t grow that much over 32 layers.
Either our experiments are somehow flawed, or we are misreading the paper. Now, I am not certain whether the paper claims that selus are not prone to exploding gradients, or whether they remain trainable regardless of any exploding gradients.
Philipp et al. The exploding gradient problem demystified find that models tend to either be prone to exploding gradients, or suffer from a collapsing domain, both of which hinder training. They suggest using either skip connections or orthogonal initialisation.
I would suggest using layerwise learning rates or an automated approach to adjusting the learning rate such as hyper-gradient descent to better cope with the differences in gradient size at different layers.
Selu seems to stabilize well with a gain of 0.75:
in: 1.0013
out: 0.7974 grad: 0.6252
out: 0.3138 grad: 0.2894
out: 0.2418 grad: 0.2414
out: 0.2097 grad: 0.2160
out: 0.2118 grad: 0.1993
out: 0.1939 grad: 0.1985
out: 0.1909 grad: 0.2259
out: 0.1849 grad: 0.2329
out: 0.2048 grad: 0.1998
out: 0.2060 grad: 0.2007
But Iâ€™m not sure why 0.75 is magic here or whether the scaling of the Glorot initialization achieves invariance under the layer size.
My takeaway was that there is a fixed point and Iâ€™m reasonably happy to try to find it myself. But donâ€™t take that as advice, I am just a random clueless guy on the internet.
Best regards
Thomas
I understood it as: even after updating the weights, they will stay in the domain of the contraction mapping so that the fixed point does not change.
Just a quick update: Thanks to Ayrton San Joaquin the SELU gain we found here is now in PyTorch 1.8: Add SELU Activation to calculate_gain by ajsanjoaquin Â· Pull Request #50664 Â· pytorch/pytorch Â· GitHub