Getting different accuracy using similar models but why?

model1:

def forward(self, x, labels=None, return_cam=False):
        batch_size = x.shape[0]
        x = self.features(x)
        
        x1 = self.conv6(x)
        x1 = self.relu(x1)
        x1 = self.conv7(x1)
        x1 = self.relu(x1)
        
        x2 = self.conv8(x)
        x2 = self.relu(x2)
        x2 = self.conv9(x2)
        x2 = self.relu(x2)

        x3 = self.conv10(x)
        x3 = self.relu(x3)
        x3 = self.conv11(x3)
        x3 = self.relu(x3)

        x = x1 + x2 + x3

model2:

def forward(self, x, labels=None, return_cam=False):
        batch_size = x.shape[0]
        x1 = self.features(x)
        x1 = self.conv6(x1)
        x1 = self.relu(x1)
        x1 = self.conv7(x1)
        x1 = self.relu(x1)
        
        x2 = self.features(x)
        x2 = self.conv8(x2)
        x2 = self.relu(x2)
        x2 = self.conv9(x2)
        x2 = self.relu(x2)

        x3 = self.features(x)
        x3 = self.conv10(x3)
        x3 = self.relu(x3)
        x3 = self.conv11(x3)
        x3 = self.relu(x3)

        x = x1 + x2 + x3

self.features contain pretrained vgg layers
I get a little better accuracy using model2. Why is that?

If self.features contains e.g. batchnorm layers, you would update the running statistics three times in model2, which might improve the training accuracy.
Let me know, if that’s not the case and we can dig a bit deeper.

There is no batch_normalization. But, after block 3, 4 and 5, I am taking sigmoid of the feature maps and then multiplying the resultant with the feature maps.

attention = torch.mean(input_, dim=1, keepdim=True)
importance_map = torch.sigmoid(attention)
return input_.mul(importance_map)

could that be the reason?
Another question, why is self.features being updated three times in model2 but only once in model1? Is the flow of gradients not the same for both the models?

The gradient flow should be the same and I was only referring to the updates of the internal batchnorm statistics.

The additional operation using the importance_map might of course change the performance of the model.
If I understand the use case correctly, you are using this attention op in the second approach, but not in the first one?

No, I am using attention op in both the models. self.features is identical for both the approaches.
Can you explain me the differences between both the models?

Could you post the complete architecture, please?

Thanks for the code.
I’ve added myModel2 as:

class myModel2(nn.Module):
    def __init__(self, features, num_classes=200, **kwargs):
        super(myModel2, self).__init__()
        self.features = features
        self.conv6 = nn.Conv2d(512,  1024, kernel_size=3, padding=1) 
        self.conv7 = nn.Conv2d(1024, num_classes, kernel_size=1)
        self.conv8 = nn.Conv2d(512,  1024, kernel_size=3, padding=1) 
        self.conv9 = nn.Conv2d(1024, num_classes, kernel_size=1)
        self.conv10 = nn.Conv2d(512,  1024, kernel_size=3, padding=1) 
        self.conv11 = nn.Conv2d(1024, num_classes, kernel_size=1)
        self.conv12 = nn.Conv2d(512,  1024, kernel_size=3, padding=1) 
        self.conv13 = nn.Conv2d(1024, num_classes, kernel_size=1)
        self.relu = nn.ReLU(inplace=False)
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        #self.fc = nn.Linear(1024, num_classes)
        initialize_weights(self.modules(), init_mode='he')
    def forward(self, x, labels=None, return_cam=False):
        batch_size = x.shape[0]
        x = self.features(x)
        
        x1 = self.conv6(x)
        x1 = self.relu(x1)
        x1 = self.conv7(x1)
        x1 = self.relu(x1)
        #x2 = self.features(x)
        x2 = self.conv8(x)
        x2 = self.relu(x2)
        x2 = self.conv9(x2)
        x2 = self.relu(x2)
        #x3 = self.features(x)
        x3 = self.conv10(x)
        x3 = self.relu(x3)
        x3 = self.conv11(x3)
        x3 = self.relu(x3)
        #x4 = self.features(x)
        x4 = self.conv12(x)
        x4 = self.relu(x4)
        x4 = self.conv13(x4)
        x4 = self.relu(x4)
        x = x1 + x2 + x3 + x4
        if return_cam:
          normalized_feature_map = normalize_tensor(x.detach().clone())
          cams = normalized_feature_map[range(batch_size), labels]
          return cams

        
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        return {'logits': x}

and get exactly the same results using this code:

features = nn.Conv2d(1, 512, 1)
model1 = myModel7(features)
model2 = myModel2(features)
model2.load_state_dict(model1.state_dict())

x = torch.randn(1, 1, 4, 4)
out1 = model1(x)
out2 = model2(x)

print((out1['logits'] - out2['logits']).abs().max())
> tensor(0., grad_fn=<MaxBackward1>)

I am sure the output is the same.

But after training both the models on CUB dataset, myModel6 gets 2.1% increase in classification accuracy compared to the other model. I just want to know, why is that? are the weights being updated similarly in both the models? Are there any differences?

def forward(self, x, labels=None, return_cam=False):
        batch_size = x.shape[0]
        x1 = self.features(x)
        x1 = self.conv6(x1)
        x1 = self.relu(x1)
        x1 = self.conv7(x1)
        x1 = self.relu(x1)
        
        x2 = self.features(x)
        x2 = self.conv8(x2)
        x2 = self.relu(x2)
        x2 = self.conv9(x2)
        x2 = self.relu(x2)

        x3 = self.features(x)
        x3 = self.conv10(x3)
        x3 = self.relu(x3)
        x3 = self.conv11(x3)
        x3 = self.relu(x3)

        x = x1 + x2 + x3

(I am sorry if i am bothering you.)

The updates should be equal, as I also get exactly the same gradients for all parameters in these models.
How reproducible is the accuracy difference? I.e. are you always seeing this gap for e.g. 10 runs with different seeds?

I will have to check that!

The gap might be of course just due to “bad luck”, but if you are seeing a consistent difference, something else is going on I’m missing.