Transfer learning usage with different input size

Thank you so much for the swift response. Here is one more question pertaining to VGG16. I read through your comments and found out it may be wise to study each layer carefully as one may perform model surgery on the layers especially when finetuning. I hard-coded out a VGG16 from scratch and compared it with the “proper version” in the source code of VGG16. I thought there isn’t any difference beside the fact that the source code made use of Sequential and it’s neater (I would tidy up). However, when using the exact same seed (it is a very robust seeding method), I get different results when using the my hard coded one vs the one from source code, both pretrained=False.

class VGG16(torch.nn.Module):
    def __init__(self, init_weights=True):
        super(VGG16, self).__init__()
        self.conv1 = torch.nn.Conv2d(in_channels=3, out_channels=64, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        self.conv2 = torch.nn.Conv2d(in_channels=64, out_channels=64, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        self.conv3 = torch.nn.Conv2d(in_channels=64, out_channels=128, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        self.conv4 = torch.nn.Conv2d(in_channels=128, out_channels=128, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        self.conv5 = torch.nn.Conv2d(in_channels=128, out_channels=256, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        self.conv6 = torch.nn.Conv2d(in_channels=256, out_channels=256, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        self.conv7 = torch.nn.Conv2d(in_channels=256, out_channels=256, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        self.conv8 = torch.nn.Conv2d(in_channels=256, out_channels=512, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        self.conv9 = torch.nn.Conv2d(in_channels=512, out_channels=512, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        self.conv10 = torch.nn.Conv2d(in_channels=512, out_channels=512, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        self.conv11 = torch.nn.Conv2d(in_channels=512, out_channels=512, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        self.conv12 = torch.nn.Conv2d(in_channels=512, out_channels=512, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        self.conv13 = torch.nn.Conv2d(in_channels=512, out_channels=512, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        # Sequential Linear (fully-connected) Layers with affine operations y=Wx+b
        self.fc1 = torch.nn.Linear(in_features=25088,out_features=4096, bias=True)
        self.fc2 = torch.nn.Linear(in_features=4096,out_features=4096, bias=True)
        # last layer before softmax - usually called include_top in Keras.
        self.fc3 = torch.nn.Linear(in_features=4096,out_features=1000, bias=True)
        # completed 16 layers, hence the name VGG16
        self.dropout = torch.nn.Dropout(p=0.5, inplace=False)
        self.activation = torch.nn.ReLU(inplace=True)
        self.avgpool = torch.nn.AdaptiveAvgPool2d((7, 7))
        
        if init_weights:
            self._initialize_weights()
        
    def forward(self, input_neurons: torch.Tensor)-> torch.Tensor:
        input_neurons = self.activation(self.conv1(input_neurons))
        input_neurons = self.activation(self.conv2(input_neurons))
        # note here we are using maxpooling with stride 2 on conv2 layer before we proceed to conv3
        input_neurons = torch.nn.MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)(self.conv2(input_neurons))
        input_neurons = self.activation(self.conv3(input_neurons))
        input_neurons = self.activation(self.conv4(input_neurons))
        input_neurons = torch.nn.MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)(self.conv4(input_neurons))
        input_neurons = self.activation(self.conv5(input_neurons))
        input_neurons = self.activation(self.conv6(input_neurons))
        input_neurons = self.activation(self.conv7(input_neurons))
        input_neurons = torch.nn.MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)(self.conv7(input_neurons))
        input_neurons = self.activation(self.conv8(input_neurons))
        input_neurons = self.activation(self.conv9(input_neurons))
        input_neurons = self.activation(self.conv10(input_neurons))
        input_neurons = torch.nn.MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)(self.conv10(input_neurons))
        input_neurons = self.activation(self.conv11(input_neurons))
        input_neurons = self.activation(self.conv12(input_neurons))
        input_neurons = self.activation(self.conv13(input_neurons))
        input_neurons = torch.nn.MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)(self.conv13(input_neurons))
        # Adaptive Layer
        input_neurons = self.avgpool(input_neurons)
        # Flatten
        input_neurons = torch.flatten(input_neurons,1)
        # or
        # input_neurons = torch.view(input_neurons, -1)
        # Fully Connected Layers Below
        input_neurons = self.dropout(self.activation(self.fc1(input_neurons)))
        input_neurons = self.dropout(self.activation(self.fc2(input_neurons)))
        input_neurons = self.fc3(input_neurons)
        return input_neurons
        
    def _initialize_weights(self) -> None:
        for m in self.modules():
            if isinstance(m, torch.nn.Conv2d):
                torch.nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    torch.nn.init.constant_(m.bias, 0)
            elif isinstance(m, torch.nn.BatchNorm2d):
                torch.nn.init.constant_(m.weight, 1)
                torch.nn.init.constant_(m.bias, 0)
            elif isinstance(m, torch.nn.Linear):
                torch.nn.init.normal_(m.weight, 0, 0.01)
                torch.nn.init.constant_(m.bias, 0)

Then I defined a dummy tensor
rand_tensor = torch.ones(8, 3, 64, 64, dtype=torch.float).to(device)
and compare both versions as follows:

vgg16hongnan = vgg(arch='vgg16', pretrained=False, progress=True)
vgg16hongnan= vgg16hongnan.to(device)

vgg16v1 = models.vgg16(pretrained=False)
vgg16v1=vgg16v1.to(device)

vgg16hongnan(rand_tensor) gives a different answer from vgg16v1(rand_tensor). I reckon I made some layers error in between… But I checked a few times and thought it’s fine. PS: I made sure to run it on the same GPU and clear cache everytime to ensure determinsitic results.

I wouldn’t recommend to use the seeding approach, as you would have to be very careful about using exactly the same calls into the pseudorandom number generator.
An easier approach would be to load the state_dict from one model to the other and compare the outputs. Since your architectures might differ a bit, you might need to manipulate the keys of the state_dict to make them match.

1 Like

Thank you for a great suggestion.
I have a quick question on the normalization part if we use the idea of adding a Conv2d layer in the beginning to map 1 channel to 3 channel. Pre-trained models require to normalize the data to the given mean and std values on the pre-trained data set. For example, VGG16 trained on ImageNet data set requires us to normalize with mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]. I assume we could take an average on 3 channels, but I’m just confused if it’s okay or not.

hi. i am working on multilabel image classification. my image is of 2562562 size and i want to use vgg-16 ,how can i reshape it.

You could use torchvision.transforms.Resize to resize the images to a spatial size of 224x224.
I’m not sure what the 2 channels represent in your images, but you might need to transform these images either to have 3 channels (since RGB input tensors are expected) or you could also use a custom conv layer, which would accept inputs with 2 channels and would output 3 channels.

2 channel represents grey and white images in my problem

vgg-16 can take input as any size more than 32 and channel should be 3 .Am i right?

Yes, that should be the case, since adaptive pooling layers are used and thus the spatial size is not fixed to e.g. 224x224.
A quick test also works for an input tensor of [batch_size, 3, 32, 32] and fails for smaller spatial sizes.

Hi, I tried to extract features through pre-trained AlexNet model in torch.
I applied previous reply from @Dr_John .
"
model = models.vgg16(pretrained=True)
first_conv_layer = list(nn.Conv2d(1, 3, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True))
first_conv_layer .extend(list(model.features))
model.features= nn.Sequential(*first_conv_layer )
"

In my cases, I want to extract features from tabular dataset (size: 1X8) not image dataset(size: 114X114 or 224X224).
So, I tried to add conv layer in front of AlexNet model.
Can I get some advice to apply tabular dataset to AlexNet model?

Hi, I saw your answer at Transfer learning usage with different input size - #22 by ptrblck and I’m glad you’re here. First of all, I use pytorch and your answers in the forum have helped me a lot, so thank you.
I am currently regressing images using VGG19 pretrained on torchvision.
I am taking the orginal image as input and doing the prediction, but I want to do the prediction with two images, original image + transform image !

cnn = models.vgg19(pretrained=True)
fc_inp_dim = cnn.classifier[6].in_features
cnn.classifier[6] = nn.Sequential(
nn.Dropout(p=0.5, inplace=False),
nn.Linear(fc_inp_dim, 1),
nn.Sigmoid()
)

The current code. I don’t know how to add the input to the front of the model, I know it’s cumbersome, but I would appreciate an answer.

Translated with DeepL

I’m not sure I understand the question completely, but I assume you want to pass the original and transformed image to your model?
If so, something like this should work:

cnn = models.vgg19(pretrained=True)
fc_inp_dim = cnn.classifier[6].in_features
cnn.classifier[6] = nn.Sequential(
    nn.Dropout(p=0.5, inplace=False),
    nn.Linear(fc_inp_dim, 1),
    nn.Sigmoid()
)

x_original = torch.randint(0, 256, (1, 3, 224, 224)).float()
x_transformed = x_original / 255.

out_original = cnn(x_original)
out_transformed = cnn(x_transformed)

Note that I just normalized the input as an example of a transformation.
I’m still unsure if I’m missing the point of the question so let me know if this works.

My previous explanation was insufficient.
I already have an original image and a transformed image. I wanted to change it so that I could put these two images in the input of the model.
Is this possible?

If you want to call the forward with two input tensors, you could create a new custom model and implement the forward method as you wish (e.g. allowing multiple input tensors).

Oh, I see.
Thanks for the quick reply!
Your answer saved me a lot of time!
Have a nice day :slight_smile: