I have a CNN that looks as follows:
VGG(
(features): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): ReLU(inplace=True)
(5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(6): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(7): ReLU(inplace=True)
(8): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): ReLU(inplace=True)
(10): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(11): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(12): ReLU(inplace=True)
(13): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(14): ReLU(inplace=True)
(15): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(16): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(17): ReLU(inplace=True)
(18): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(19): ReLU(inplace=True)
(20): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(21): AvgPool2d(kernel_size=1, stride=1, padding=0)
)
(classifier): Linear(in_features=512, out_features=10, bias=True)
)
My working assumption is that when I multiply every weight and every bias by two, the resulting logits should have the same distribution as before, just scaled up by the factor (2 ** #layers). Specifically, every individual operation (Conv2d, ReLu, MaxPool2d, AvgPool2d, Linear) should change nothing about the resulting activation distribution except its scale (if anything).
Conv2d: Since it’s just a weighted sum plus the bias, the resulting activations should be twice as large
ReLu: As long as the sign doesn’t change there’s no change here
MaxPool2d: The maximum of scaled-up activations remains unchanged
AvgPool2d: The average of scaled-up activations is also just scaled up
Linear: Same as Conv2d
However, when I multiply every weight and bias in my network by two, the distribution of the logits changes. For example, when my CNN first gave me these logits for an input:
[ -5.5469, -1.3721, -1.7734, 2.9941, 1.6348, 1.5049, 13.4219, -3.9648, -3.7793, -3.0293]
after multiplying each weight and bias by two I get the following logits:
[ -628.0000, 409.0000, -346.0000, 594.5000, -304.0000, 144.8750, 1746.0000, -708.0000, -622.0000, -276.0000]
Not only is the scale off, but even the signs of some of the activations changed. Am I experiencing a numerical/overflow issue in PyTorch or do I have some error in my thought process?