I downloaded the PyTorch VGG16 model with batchnorm (vgg16_bn) model and inspected the state dictionary. Instead of the one extra layer after each conv layer (for the batch normalization), I see 4 extra layers. Here is the output:
> sd = torch.load('vgg16_bn-6c64b313.pth')
> for k in sd:
> ... print sd[k].shape
> ...
> (64L, 3L, 3L, 3L)
> (64L,)
> (64L,)
> (64L,)
> (64L,)
> (64L,)
> (64L, 64L, 3L, 3L)
> (64L,)
> (64L,)
> (64L,)
> (64L,)
> (64L,)
> (128L, 64L, 3L, 3L)
> (128L,)
> (128L,)
> (128L,)
> (128L,)
> (128L,)
> (128L, 128L, 3L, 3L)
> (128L,)
> (128L,)
> (128L,)
> (128L,)
> (128L,)
> (256L, 128L, 3L, 3L)
> (256L,)
> (256L,)
> (256L,)
> (256L,)
> (256L,)
> (256L, 256L, 3L, 3L)
> (256L,)
> (256L,)
> (256L,)
> (256L,)
> (256L,)
> (256L, 256L, 3L, 3L)
> (256L,)
> (256L,)
> (256L,)
> (256L,)
> (256L,)
> (512L, 256L, 3L, 3L)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L, 512L, 3L, 3L)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L, 512L, 3L, 3L)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L, 512L, 3L, 3L)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L, 512L, 3L, 3L)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L, 512L, 3L, 3L)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (4096L, 25088L)
> (4096L,)
> (4096L, 4096L)
> (4096L,)
> (1000L, 4096L)
> (1000L,)
The torchvision model itself shows what I’d expect, blocks of conv-BN-ReLU modules:
vgg = torchvision.models.vgg16_bn(False)
vgg
VGG(
(features): Sequential(
(0): Conv2d (3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU(inplace)
(3): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
(5): ReLU(inplace)
(6): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
(7): Conv2d (64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
(9): ReLU(inplace)
(10): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
(12): ReLU(inplace)
(13): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
(14): Conv2d (128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(15): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(16): ReLU(inplace)
(17): Conv2d (256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(18): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(19): ReLU(inplace)
(20): Conv2d (256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(21): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(22): ReLU(inplace)
(23): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
(24): Conv2d (256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(25): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(26): ReLU(inplace)
(27): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(28): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(29): ReLU(inplace)
(30): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(31): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(32): ReLU(inplace)
(33): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
(34): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(35): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(36): ReLU(inplace)
(37): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(38): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(39): ReLU(inplace)
(40): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(41): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(42): ReLU(inplace)
(43): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
)
(classifier): Sequential(
(0): Linear(in_features=25088, out_features=4096)
(1): ReLU(inplace)
(2): Dropout(p=0.5)
(3): Linear(in_features=4096, out_features=4096)
(4): ReLU(inplace)
(5): Dropout(p=0.5)
(6): Linear(in_features=4096, out_features=1000)
)
)
What are the extra layers I’m seeing in the pre-trained model?
EDIT: After realizing I could just look at the keys (duh!), I see they are for running mean and variance computations. That being said, why are there 4 extra and not 2?