Custom Ensemble approach

please, could you check my model summary:

[VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace)
    (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU(inplace)
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): ReLU(inplace)
    (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (12): ReLU(inplace)
    (13): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (14): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (16): ReLU(inplace)
    (17): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (19): ReLU(inplace)
    (20): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (21): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (22): ReLU(inplace)
    (23): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (24): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (25): ReLU(inplace)
    (26): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (27): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (28): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(29): ReLU(inplace)
    (30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (31): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (32): ReLU(inplace)
    (33): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (34): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (35): ReLU(inplace)
    (36): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (37): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (38): ReLU(inplace)
    (39): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (40): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (41): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (42): ReLU(inplace)
    (43): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (44): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (45): ReLU(inplace)
    (46): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (47): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (48): ReLU(inplace)
    (49): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (50): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (51): ReLU(inplace)
    (52): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
  (classifier): Sequential(
    (0): Linear(in_features=25088, out_features=4096, bias=True)
    (1): ReLU(inplace)
    (2): Dropout(p=0.5)
    (3): Linear(in_features=4096, out_features=4096, bias=True)
    (4): ReLU(inplace)
    (5): Dropout(p=0.5)
    (6): Linear(in_features=4096, out_features=1000, bias=True)
  )
), GlobalPool(
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (maxpool): AdaptiveMaxPool2d(output_size=(1, 1))
  (exp_pool): ExpPool()
  (linear_pool): LinearPool()
  (lse_pool): LogSumExpPool()
), Conv2d(1024, 1, kernel_size=(1, 1), stride=(1, 1)), Conv2d(1024, 1, kernel_size=(1, 1), stride=(1, 1)), Conv2d(1
024, 1, kernel_size=(1, 1), stride=(1, 1)), Conv2d(1024, 1, kernel_size=(1, 1), stride=(1, 1)), Conv2d(1024, 1, ker
nel_size=(1, 1), stride=(1, 1)), Conv2d(1024, 1, kernel_size=(1, 1), stride=(1, 1)), Conv2d(1024, 1, kernel_size=(1
, 1), stride=(1, 1)), Conv2d(1024, 1, kernel_size=(1, 1), stride=(1, 1)), BatchNorm2d(1024, eps=1e-05, momentum=0.1
, affine=True, track_running_stats=True), BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_sta
ts=True), BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True), BatchNorm2d(1024, eps=
1e-05, momentum=0.1, affine=True, track_running_stats=True), BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True
, track_running_stats=True), BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True), Bat
chNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True), BatchNorm2d(1024, eps=1e-05, moment
um=0.1, affine=True, track_running_stats=True), AttentionMap(
  (channel_attention): CAModule(
    (fc1): Linear(in_features=512, out_features=256, bias=True)
    (fc2): Linear(in_features=256, out_features=512, bias=True)
    (relu): ReLU()
    (sigmoid): Sigmoid()
  )
  (spatial_attention): SAModule(
    (conv1): Conv2d(512, 64, kernel_size=(1, 1), stride=(1, 1))
    (conv2): Conv2d(512, 64, kernel_size=(1, 1), stride=(1, 1))
    (conv3): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
  )
  (pyramid_attention): FPAModule(
    (gap_branch): Sequential(
      (0): AdaptiveAvgPool2d(output_size=1)
      (1): Conv2dNormRelu(
        (conv): Sequential(
          (0): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
          (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU(inplace)
        )
      )
    )
(mid_branch): Conv2dNormRelu(
      (conv): Sequential(
        (0): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace)
      )
    )
    (downsample1): Conv2dNormRelu(
      (conv): Sequential(
        (0): Conv2d(512, 1, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3))
        (1): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace)
      )
    )
    (downsample2): Conv2dNormRelu(
      (conv): Sequential(
        (0): Conv2d(1, 1, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
        (1): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace)
      )
    )
    (downsample3): Conv2dNormRelu(
      (conv): Sequential(
        (0): Conv2d(1, 1, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
        (1): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace)
      )
    )
    (scale1): Conv2dNormRelu(
      (conv): Sequential(
        (0): Conv2d(1, 1, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
        (1): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace)
      )
    )
    (scale2): Conv2dNormRelu(
      (conv): Sequential(
        (0): Conv2d(1, 1, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
        (1): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace)
      )
    )
(scale3): Conv2dNormRelu(
      (conv): Sequential(
        (0): Conv2d(1, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU(inplace)
      )
    )
  )
)]

What is the shape of x before feeding it to self.classifier?

Based on the model summary, it seems youā€™ve changed the model, since e.g. classifier is not an nn.Sequential container while it was a single linear layer before.

torch.cat yield a tensor with size of [24,1] where the size of each x1,x2 and x3 is [8,1]

exactly my model has some modification. So, i think i miss something
and thank you so much for your help

This shouldnā€™t be the case, if you have kept torch.cat((x1, x2, x3), dim=1):

x1 = torch.randn(8, 1)
x2 = torch.randn(8, 1)
x3 = torch.randn(8, 1)
x = torch.cat((x1, x2,x3), dim=1)
print(x.shape)
> torch.Size([8, 3])

Since you modified the code, please make sure to post the new (executable) code so that we can take another look.

Here is the code:

class MyEnsemble(nn.Module):
    def __init__(self, model_1, model_2, model_3 , nb_classes=8):
        super(MyEnsemble, self).__init__()
        self.model_1 = model_1
        self.model_2 = model_2
        self.model_3 = model_3
        # Remove last linear layer
        self.model_1.classifier  = nn.Identity()
        self.model_2.classifier  = nn.Identity()
        self.model_3.classifier  = nn.Identity()
        self.classifier = nn.Linear(24, 8)
        
    def forward(self, x):
        x1 = self.model_1(x.clone())  # clone to make sure x is not changed by inplace 
        x1= torch.stack(x1[0])
        x2 = self.model_2(x)
        x2= torch.stack(x2[0])
        x3 = self.model_3(x)
        x3=torch.stack(x3[0])
        x = torch.stack((x1, x2,x3), dim=1)
        x=F.relu(x.view(x.size(0), -1))
        x = self.classifier(x)
        return x

Youā€™ve changed the torch.cat call to torch.stack, which will output x in the shape [8, 3, 1], if x1, x2, and x3 are in the shape [8, 1].

1 Like

@ptrblck Hi ptrblck!

I am a new leaner in Pytroch. I have two trained models and try to use both of them to predict. I load the models by:

modelA = torch.load('~/CNN/pytorch/ensem_model_1.py')
modelA.eval()

modelB = torch.load(~/CNN/pytorch/ensem_model_2.py')
modelB.eval()

With your ensemble codes:

## Predict with ensemble models
class MyEnsemble(nn.Module):
    def __init__(self, modelA, modelB, nb_classes=251):
        super(MyEnsemble, self).__init__()
        self.modelA = modelA
        self.modelB = modelB
        # Remove last linear layer
        self.modelA.fc = nn.Identity()
        self.modelB.fc = nn.Identity()
        
        # Create new classifier
        self.classifier = nn.Linear(2048+2048, nb_classes)
        
    def forward(self, x):
        x1 = self.modelA(x.clone())  # clone to make sure x is not changed by inplace methods
        x1 = x1.view(x1.size(0), -1)
        x2 = self.modelB(x)
        x2 = x2.view(x2.size(0), -1)
        x = torch.cat((x1, x2), dim=1)
        
        x = self.classifier(F.relu(x1,x2))
        return x

The prediction is procced by:

model = MyEnsemble(modelA, modelB)
model = model.to(device)
print(check_accuracy_part34(loader_val, model))

where,

def check_accuracy_part34(loader, model):
    num_correct = 0
    num_samples = 0
    model.eval()  # set model to evaluation mode
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)
            scores = model(x)
            _, preds = scores.max(1)
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
        acc = float(num_correct) / num_samples
        print('  Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
        return acc, preds

My problem is the validation accuracy I got from the ensembled model is very very low, about 0.3%. But if I test the performance of my single model only, the val accuracy is about 65%. It feels like the ensembled model does not inherit the trained parameters. I am not sure if what my intuition is correct or not. Any help about this?

Thanks in advance!

Note that you are creating a new linear layer in MyEnsemble, which will be randomly initialized.
This classifier will take the pen-ultimate activations from both base models and output the new predictions.
If you havenā€™t retrained this layer, the performance is expected to be bad.

Could you please give an example to ensemble two models in Faster RCNN?
A=model1.pth
B=Model2.pth
I need C=model=(A+B) for single input
code example:

RoIHeads(
  (box_roi_pool): MultiScaleRoIAlign()
  (box_head): TwoMLPHead(
    (fc6): Linear(in_features=12544, out_features=1024, bias=True)
    (fc7): Linear(in_features=1024, out_features=1024, bias=True)
  )
  (box_predictor): FastRCNNPredictor(
    (cls_score): Linear(in_features=1024, out_features=50, bias=True)
    (bbox_pred): Linear(in_features=1024, out_features=412, bias=True)
  )
)

It depends, which feature or outputs you would like to concatenate and how this ensemble would look like.
Could you add some more information, so that I could see how it might be implemented?

Thanks for reply,
I have a modelA which is trained on class ā€˜aā€™ and modelB is trained on class ā€˜bā€™, both have similar type of data (X) so I need a modelC to predict classes ā€˜aā€™ and ā€˜bā€™ by giving an input X.

thanks againļ¼Œrecently I am learning the ensemble. ensemble including Baggingļ¼ŒBoosting and Stacking
could you tell me the method belong which kind of methodļ¼Ÿ

My code example would probably come close to the Stacking method, since another classifier is trained on top of the feature outputs from the pretrained models.

A quick recap of the mentioned techniques (since you are currently studying them, please correct me, as I havenā€™t looked into it recently):

  • Bagging - would involve bootstrapping during the sampling phase and should thus be independent from the model architecture (it would use weak learners, if Iā€™m not mistaken)
  • Boosting - sequential weak learners, which are trained on the ā€œresidualsā€ of the preceding classifiers
  • Stacking - staged classifiers, which would use the output of the previous stage to create a new output

thanks a lot. In fact I want to use the pytorch to implement the three methods, especially the Bagging. I see a lot of papers they said use the ensemble can improve the result. I use google to search the relate code in github and kaggle but couldnā€™t find the similar code(most are use the Keras ) until I use your code it make a better result, but I see the topic ā€œensembleā€ your most code are use the concat, for example x = torch.cat((x1, x2), dim=1) .
but some papers said they use the average or voting methods to implement the ensemble. I am not find the related code in Pytorch Forums. Is there a full blown example of how to ensemble two models for example- vgg19 and resnet18 for prediction on the same dataset and use the voting or average method? or could you recommend some code for me to learn. I would really appreciate it.

I donā€™t know, if there are examples, but to calculate the average of multiple model outputs, you could use:

outputA = modelA(data)
outputB = modelB(data)
outputs = (F.softmax(outputA, 1) + F.softmax(outputB, 1)) / 2.

(if you have multiple models, you could of course use a loop, if thatā€™s easier)

To implement the bootstrap sampling, you could use e.g. sklearn.cross_validation.Bootstrap.

Hi dear Rosa,
How can I update my pytorch? Could you please tell me what is its instruction?
thank you

@ptrblck , Thank you . I have a similar requirement. I have one more doubt, what should I do with loss, should I add the losses from 2 models , if yes where. I am very new to pytorch, request your inputs. also both of my models are image classifiers, I am experimenting with more than one pre-trained models like inception_v3, resnet etc , in my case both of my models will be same ,say inception_v3 and both will have equal number of classess in output (10). and final output will be binary. so the output will identify if the image is a vowel or consonant. the input data has images of vowels and consonants.

In my code snippets the output of the submodels is fed to a new classifier and the original classifiers in the submodules are removed.
This would yield a single output and thus a single loss would be calculated.

1 Like

@ptrblck , Thanks for response. very helpful. in my case I realized that each classifier will give single output ( example - first one will predict the class for vowel and second - class of consonant). as per my understanding the code snipped you gave will work for this case also. I am going to try it now with my data and will update. wanted to add this and take your advise.

Your approach should work fine and you could create two final classifiers in the main model and feed the two outputs from the submodules to them.

1 Like