What are the extra layers in the vgg16_bn pre-trained model?

I downloaded the PyTorch VGG16 model with batchnorm (vgg16_bn) model and inspected the state dictionary. Instead of the one extra layer after each conv layer (for the batch normalization), I see 4 extra layers. Here is the output:

> sd = torch.load('vgg16_bn-6c64b313.pth')
> for k in sd:
> ...   print sd[k].shape
> ... 
> (64L, 3L, 3L, 3L)
> (64L,)
> (64L,)
> (64L,)
> (64L,)
> (64L,)
> (64L, 64L, 3L, 3L)
> (64L,)
> (64L,)
> (64L,)
> (64L,)
> (64L,)
> (128L, 64L, 3L, 3L)
> (128L,)
> (128L,)
> (128L,)
> (128L,)
> (128L,)
> (128L, 128L, 3L, 3L)
> (128L,)
> (128L,)
> (128L,)
> (128L,)
> (128L,)
> (256L, 128L, 3L, 3L)
> (256L,)
> (256L,)
> (256L,)
> (256L,)
> (256L,)
> (256L, 256L, 3L, 3L)
> (256L,)
> (256L,)
> (256L,)
> (256L,)
> (256L,)
> (256L, 256L, 3L, 3L)
> (256L,)
> (256L,)
> (256L,)
> (256L,)
> (256L,)
> (512L, 256L, 3L, 3L)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L, 512L, 3L, 3L)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L, 512L, 3L, 3L)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L, 512L, 3L, 3L)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L, 512L, 3L, 3L)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L, 512L, 3L, 3L)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (512L,)
> (4096L, 25088L)
> (4096L,)
> (4096L, 4096L)
> (4096L,)
> (1000L, 4096L)
> (1000L,)

The torchvision model itself shows what I’d expect, blocks of conv-BN-ReLU modules:

vgg = torchvision.models.vgg16_bn(False)
vgg
VGG(
  (features): Sequential(
    (0): Conv2d (3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    (2): ReLU(inplace)
    (3): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    (5): ReLU(inplace)
    (6): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
    (7): Conv2d (64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (9): ReLU(inplace)
    (10): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (12): ReLU(inplace)
    (13): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
    (14): Conv2d (128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (16): ReLU(inplace)
    (17): Conv2d (256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (19): ReLU(inplace)
    (20): Conv2d (256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (21): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (22): ReLU(inplace)
    (23): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
    (24): Conv2d (256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (25): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    (26): ReLU(inplace)
    (27): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (28): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    (29): ReLU(inplace)
    (30): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (31): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    (32): ReLU(inplace)
    (33): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
    (34): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (35): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    (36): ReLU(inplace)
    (37): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (38): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    (39): ReLU(inplace)
    (40): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (41): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    (42): ReLU(inplace)
    (43): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
  )
  (classifier): Sequential(
    (0): Linear(in_features=25088, out_features=4096)
    (1): ReLU(inplace)
    (2): Dropout(p=0.5)
    (3): Linear(in_features=4096, out_features=4096)
    (4): ReLU(inplace)
    (5): Dropout(p=0.5)
    (6): Linear(in_features=4096, out_features=1000)
  )
)

What are the extra layers I’m seeing in the pre-trained model?

EDIT: After realizing I could just look at the keys (duh!), I see they are for running mean and variance computations. That being said, why are there 4 extra and not 2?

batchnorm weight, batchnorm bias, batchnorm running_mean, batchnorm running_std
(batchnorm has an affine transform by default)

Then, why is it 5 extra weights, not 4? As far, as I understand, we don’t need bias in conv layer if the batch normalization layer follows immediately after convolution (just because bias in conv will be absorbed by the running_mean).

5 extra weights = 4 batchnorm weights + 1 conv bias.

Yes, you can disable conv bias by passing Conv2d(…, bias=False) in constructor.