How do I find the standard deviation of activations?

ErlendFax · November 26, 2018, 3:18pm

On this page (deeplearning4j.org) it says:

A good standard deviation for the activations is on the order of 0.5 to 2.0. Significantly outside of this range may indicate one of the problems mentioned above.

How can I find the std of my activations using PyTorch?

I can find the gradients like this, and they look a bit small so I want to investigate this further, or do they really look small?

for i in model.named_parameters():
        print("Layer name: ", str(i[0]))
        print("Grad Max Value: ", np.amax(i[1].grad.numpy()))
        print("Grad Min Value: ", np.amin(i[1].grad.numpy()))

Output:

Layer 1
Grad Max Value: 0.002665516
Grad Min Value: -0.0016368543

…

Layer 4
Grad Max Value: 0.0009877739
Grad Min Value: -0.0004688645

Layer out
Grad Max Value: 1.4279677e-06
Grad Min Value: -0.0024869342

ptrblck · November 26, 2018, 3:29pm

You could use forward hooks to print the std deviation of the layer outputs:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 3, 1, 1)
        self.pool1 = nn.MaxPool2d(2)
        self.conv2 = nn.Conv2d(6, 1, 3, 1, 1)
        self.pool2 = nn.MaxPool2d(2)
        self.fc = nn.Linear(6*6, 10)
        
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        x = F.relu(self.conv2(x))
        x = self.pool2(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x


model = MyModel()
model.conv1.register_forward_hook(lambda m, x, out: print(out.std()))
model.conv2.register_forward_hook(lambda m, x, out: print(out.std()))
model.fc.register_forward_hook(lambda m, x, out: print(out.std()))
x = torch.randn(1, 3, 24, 24)
output = model(x)

ErlendFax · November 26, 2018, 3:37pm

Cool, thanks! Just what I was looking for.

Any comments on my gradients? Or is this a case-to-case thing where there are no right or wrong gradients?

ptrblck · November 26, 2018, 3:48pm

I’m glad it’s working.
Stanford’s CS231n states that the ratio of weights:updates should be roughly 1e-3 (source).
I’m not sure, it that’s still a valid assumption today, especially with some tricks and hacks to accelerate the training.