Correct way to build and get encodings from siamese using pretrained model

I am trying to build a small siamese network (with an aim to get encodings from the last/pre-last layer) and would like to use a pretrained model + the extra layers needed to get the encodings.

I have something like this at the moment, but the results dont look great, so I now wonder if this is the correct way to build off a pretrained model.

class PretrainedSiamese(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.pt_model = torch.hub.load('pytorch/vision', 'some_pretrained_model', pretrained=True)
        for param in self.pt_model.parameters():
            param.requires_grad = False
        self.pt_model.classifier[1] = torch.nn.Linear(in_features=self.model_ft.classifier[1].in_features, out_features=2)
        self.linear1 = nn.Linear(in_features=2, out_features=256, bias=True)
        self.linear2 = nn.Linear(in_features=256, out_features=128, bias=True) # encoder layer
        self.linear3 = nn.Linear(in_features=128, out_features=2, bias=True)

The forwrad I have is simply:

            out = self.pt_model(out)
            out = x.view(out.shape[0], -1)
            out = self.linear1(out)
            out = F.relu(out)
            out = self.linear2(out)
..< do siamese i.e loop twice>

So, I pass the outputs (logits) from the pretrained classifier and attach a small network to this to get the encoding from the pre-final layer. Does this sound sensible? I am unsure now that I have terrible results :slight_smile:

The code looks generally alright, assuming that self.model_ft is a typo and should be self.pt_model and that .classifier[1] is the last linear layer in the pretrained model.

If that’s the case then the number of output features for the first linear layer seem to be a bit small with only 2 neurons. You would create quite a strong bottleneck at this point and might lose a lot of information from the signal.

@ptrblck: thank you so much for your input. Yes, it was a typo, corrected it now.

I completely understand what you are saying about the bottleneck - but, I was wondering about this below question (and hence chose 2):
[1] the siamese is almost a binary classifier with only two outputs (1/0 for same paid, different pair). The pretrained model is trained on a 1000 classes, but I have only 2.

Does this rational make any sense? In essense, the output from the pretrained model can at most have a 1000 classes (assuming 1000 classes is what the pt model is trained on?) If not, what should this number be? or is it another hyper paramter?

Thanks for the information.
I didn’t realize that the 2 would represent the number of classes.

Usually you would use the penultimate layer from the pretrained model as the “feature” output and add a classifier on top of it. Currently pt_model.classifier[1] as well as linear3 could be seen as a classification layer and the first bottleneck might be too aggressive.
However, as always it all depends on your use case and your model might in fact work just fine. :slight_smile:

@ptrblck Thank you for the suggestion. Infact, I did want to do what you are suggesting but couldnt find a way to do it: i.e how do I get the output of the penultimate layer and skip the classifier in the pretrained model? I can then attach my layers to this penultimate layer.

I would be really greatful if you could point me to any info on how to get the feature output from the penultimate layer.

@ptrblck: actually, it turned out to be quite simple. I just did:

self. pt_model=self.pt_model.features

it seems to work!

This approach might work or alternatively you could also replace the last layer with nn.Identity() and would thus get the feature tensor as the output. :slight_smile:

@ptrblck thank you again for this reply. Could you please point me to an example of how to use the Identity function to achieve this? Would like to try it. BTW: the last layer in say mobilenet is a classifier

<.. features..>
    (18): ConvBNReLU(
      (0): Conv2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (1): BatchNorm2d(1280, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU6(inplace=True)
  (classifier): Sequential(
    (0): Dropout(p=0.2, inplace=False)
    (1): Linear(in_features=1280, out_features=1000, bias=True)
  (0): Dropout(p=0.2, inplace=False)
  (1): Linear(in_features=1280, out_features=1000, bias=True)

I presume I drop the classifier which is what I do with: self. pt_model=self.pt_model.features. Thus, the output is now from (2): ReLU6(inplace=True) - does this make sense? Would love to know how to do it via Identity method - just to be sure that it gives the same output.

Many thanks for all your help. REALLY appreciate it.

Your assignment might work, but I would always double check, if you really get the desired outputs.
Since you are assigning the features block to pt_model, you assume that this block is executed sequentially in the original forward method, which might not always be the case.

To just reuse the original forward method, you could use:

pt_model.classifier = nn.Identity()

to replace the classifier module with a simple identity method.

@ptrblck: when I try the Identity method you have suggested, my outputs are way smaller (1280 for mobilenet, which is the size of the classfier input) - but the output of the (2): ReLU6(inplace=True) when I do self. pt_model=self.pt_model.features seems 20480. Bizzare!

May I ask what you mean by a feature vector? To me this (2): ReLU6(inplace=True) is a feature vector. No?

Also, for mobilenet, I see:

  (features): Sequential(

so they are all sequential right?

In any case, I think the Identity method is better since it preserves the original instamce and I would prefer this if I can understand the size issue!

When I look at the mobilenet arch, towards the end, I seem to have:

    (18): ConvBNReLU(
      (0): Conv2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (1): BatchNorm2d(1280, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU6(inplace=True)
  (classifier): Sequential(
    (0): Dropout(p=0.2, inplace=False)
    (1): Linear(in_features=1280, out_features=1000, bias=True)
  (0): Dropout(p=0.2, inplace=False)
  (1): Linear(in_features=1280, out_features=1000, bias=True)

By turning the classfier to Identity, my output at the end should be the size of (2): ReLU6(inplace=True) - is that not right?

Thanks again @ptrblck.

That’s exactly the difference between reusing a submodule (features in this case) and replacing just the classifier while keeping the forward method.
As you can see in this line of code a functional pooling is applied, which will reduce the spatial size of the activation and reshape it.

I cannot say which approach is the right one, so you might want to run some experiments with and without the pooling.

@ptrblck: Thank you fo this. Pardon my ignorance, but, when you say:

I cannot say which approach is the right one, so you might want to run some experiments with and without the pooling.

do you mean I modify the file and run experiments with and without this line:

x = nn.functional.adaptive_avg_pool2d(x, 1).reshape(x.shape[0], -1)

Also, what am I looking for in the results? Do you mean to say, I should measure accuracy in the two cases (with and withut the line?) Sorry about the naive questions, have not done this kind of experimenting before. I suspect what you are saying is that the approach is problem dependent (with and without pooling)?

Your new use case seems to use the penultimate activation tensors as an extracted feature.
While your approach would return the feature tensor before the pooling layer (which will thus be bigger), my proposed approach would apply the pooling and thus yield a smaller activation.

I don’t know, how your model (or overall use case) will use this tensor, so I cannot say that mu proposed approach is the “right one” for your use case.

Yes, I would suggest to use the validation loss (or accuracy) to compare both approaches.

@ptrblck: thank you VERY much for the detailed reply. I will run the experiments and report back. Thank you again - I would have never come this far withoutout your input!

@ptrblck: I am still doing experiments on these, but today, I saw this snippet with detectron2:

these lines:

# load a pre-trained model for classification and return
# only the features
backbone = torchvision.models.mobilenet_v2(pretrained=True).features
# FasterRCNN needs to know the number of
# output channels in a backbone. For mobilenet_v2, it's 1280
# so we need to add it here
backbone.out_channels = 1280

means that instead of using Identity() as you suggested, they are using just the .features but, I am perplexed how they are assigning the out_channels here as 1280, since the features as you have shown on this line:

…uses avg pooling to get to 1280. So, what is the detectron tutorial doing here? :frowning: Using just the features cannot result in a vector of size 1280. Would be nice to know what this tutorial is doing

This attribute is not directly used in mobilenet_v2, but in FasterRCNN.
From the docs for FasterRCNN:

backbone (nn.Module): the network used to compute the features for the model.
It should contain a out_channels attribute, which indicates the number of output
channels that each feature map has (and it should be the same for all feature maps).
The backbone should return a single Tensor or and OrderedDict[Tensor].

So you are just creating this attribute, so that it can be used in the RPNHead.

@ptrblck: thanks again for the explanation - makes sense :slight_smile:

Hi @ptrblck,

I’m trying to build a DANN model from this paper. The architecture is given in the figure below.

For the feature extractor (green part) I’d like to use a pretrained model, e.g. resnet50. I’m just wondering if my code is correct.

class ReverseLayerF(torch.autograd.Function):
    def forward(ctx, x, alpha):
        ctx.alpha = alpha
        return x.view_as(x)

    def backward(ctx, grad_output):
        output = grad_output.neg() * ctx.alpha
        return output, None

class DANN(torch.nn.Module):
    def __init__(self,architecture,pretrained=True):
        super(DANN, self).__init__()
        assert architecture in ['shuffle05', 'mobile', 'resnet50']
        if architecture == 'shuffle05':
            model = models.shufflenet_v2_x0_5(pretrained=pretrained)
        elif architecture == 'mobile':
            model = models.mobilenet_v2(pretrained=pretrained)
            model = models.resnet50(pretrained=pretrained)
        self.feature_extractor = torch.nn.Sequential(*list(model.children())[:-1])

        self.label_classifier = torch.nn.Sequential(

        self.domain_classifier = torch.nn.Sequential(
            torch.nn.Linear(2048, 128)

    def forward(self,x,alpha):
        feature = self.feature_extractor(x)
        feature = feature.view(feature.shape[0],-1)
        reverse_feature = ReverseLayerF.apply(feature, alpha)
        class_logit = self.label_classifier(feature)
        domain_logit = self.domain_classifier(reverse_feature)
        return class_logit,domain_logit

my_model = DANN('resnet50')

The code looks alright. Do you see any errors or issues during training?

You should be a bit careful with wrapping all child modules into an nn.Sequential container:

self.feature_extractor = torch.nn.Sequential(*list(model.children())[:-1])

While this would work for simple models, it could break for models which are using functional API calls inside their forward method.
If you want to remove the last layer, you could replace it with an nn.Identity() module instead.

1 Like