A good standard deviation for the activations is on the order of 0.5 to 2.0. Significantly outside of this range may indicate one of the problems mentioned above.
How can I find the std of my activations using PyTorch?
I can find the gradients like this, and they look a bit small so I want to investigate this further, or do they really look small?
for i in model.named_parameters():
print("Layer name: ", str(i[0]))
print("Grad Max Value: ", np.amax(i[1].grad.numpy()))
print("Grad Min Value: ", np.amin(i[1].grad.numpy()))
Output:
Layer 1
Grad Max Value: 0.002665516
Grad Min Value: -0.0016368543
…
Layer 4
Grad Max Value: 0.0009877739
Grad Min Value: -0.0004688645
Layer out
Grad Max Value: 1.4279677e-06
Grad Min Value: -0.0024869342
I’m glad it’s working.
Stanford’s CS231n states that the ratio of weights:updates should be roughly 1e-3 (source).
I’m not sure, it that’s still a valid assumption today, especially with some tricks and hacks to accelerate the training.