# Getting different accuracy using similar models but why?

model1:

``````def forward(self, x, labels=None, return_cam=False):
batch_size = x.shape
x = self.features(x)

x1 = self.conv6(x)
x1 = self.relu(x1)
x1 = self.conv7(x1)
x1 = self.relu(x1)

x2 = self.conv8(x)
x2 = self.relu(x2)
x2 = self.conv9(x2)
x2 = self.relu(x2)

x3 = self.conv10(x)
x3 = self.relu(x3)
x3 = self.conv11(x3)
x3 = self.relu(x3)

x = x1 + x2 + x3
``````

model2:

``````def forward(self, x, labels=None, return_cam=False):
batch_size = x.shape
x1 = self.features(x)
x1 = self.conv6(x1)
x1 = self.relu(x1)
x1 = self.conv7(x1)
x1 = self.relu(x1)

x2 = self.features(x)
x2 = self.conv8(x2)
x2 = self.relu(x2)
x2 = self.conv9(x2)
x2 = self.relu(x2)

x3 = self.features(x)
x3 = self.conv10(x3)
x3 = self.relu(x3)
x3 = self.conv11(x3)
x3 = self.relu(x3)

x = x1 + x2 + x3
``````

`self.features` contain pretrained vgg layers
I get a little better accuracy using model2. Why is that?

If `self.features` contains e.g. batchnorm layers, you would update the running statistics three times in `model2`, which might improve the training accuracy.
Let me know, if that’s not the case and we can dig a bit deeper.

There is no batch_normalization. But, after block 3, 4 and 5, I am taking sigmoid of the feature maps and then multiplying the resultant with the feature maps.

``````attention = torch.mean(input_, dim=1, keepdim=True)
importance_map = torch.sigmoid(attention)
return input_.mul(importance_map)
``````

could that be the reason?
Another question, why is `self.features` being updated three times in `model2` but only once in `model1`? Is the flow of gradients not the same for both the models?

The gradient flow should be the same and I was only referring to the updates of the internal batchnorm statistics.

The additional operation using the `importance_map` might of course change the performance of the model.
If I understand the use case correctly, you are using this attention op in the second approach, but not in the first one?

No, I am using attention op in both the models. `self.features` is identical for both the approaches.
Can you explain me the differences between both the models?

Could you post the complete architecture, please?

Thanks for the code.
I’ve added `myModel2` as:

``````class myModel2(nn.Module):
def __init__(self, features, num_classes=200, **kwargs):
super(myModel2, self).__init__()
self.features = features
self.conv6 = nn.Conv2d(512,  1024, kernel_size=3, padding=1)
self.conv7 = nn.Conv2d(1024, num_classes, kernel_size=1)
self.conv8 = nn.Conv2d(512,  1024, kernel_size=3, padding=1)
self.conv9 = nn.Conv2d(1024, num_classes, kernel_size=1)
self.conv10 = nn.Conv2d(512,  1024, kernel_size=3, padding=1)
self.conv11 = nn.Conv2d(1024, num_classes, kernel_size=1)
self.conv12 = nn.Conv2d(512,  1024, kernel_size=3, padding=1)
self.conv13 = nn.Conv2d(1024, num_classes, kernel_size=1)
self.relu = nn.ReLU(inplace=False)
#self.fc = nn.Linear(1024, num_classes)
initialize_weights(self.modules(), init_mode='he')
def forward(self, x, labels=None, return_cam=False):
batch_size = x.shape
x = self.features(x)

x1 = self.conv6(x)
x1 = self.relu(x1)
x1 = self.conv7(x1)
x1 = self.relu(x1)
#x2 = self.features(x)
x2 = self.conv8(x)
x2 = self.relu(x2)
x2 = self.conv9(x2)
x2 = self.relu(x2)
#x3 = self.features(x)
x3 = self.conv10(x)
x3 = self.relu(x3)
x3 = self.conv11(x3)
x3 = self.relu(x3)
#x4 = self.features(x)
x4 = self.conv12(x)
x4 = self.relu(x4)
x4 = self.conv13(x4)
x4 = self.relu(x4)
x = x1 + x2 + x3 + x4
if return_cam:
normalized_feature_map = normalize_tensor(x.detach().clone())
cams = normalized_feature_map[range(batch_size), labels]
return cams

x = self.avgpool(x)
x = x.view(x.size(0), -1)
return {'logits': x}
``````

and get exactly the same results using this code:

``````features = nn.Conv2d(1, 512, 1)
model1 = myModel7(features)
model2 = myModel2(features)

x = torch.randn(1, 1, 4, 4)
out1 = model1(x)
out2 = model2(x)

print((out1['logits'] - out2['logits']).abs().max())
``````

I am sure the output is the same.

But after training both the models on CUB dataset, myModel6 gets 2.1% increase in classification accuracy compared to the other model. I just want to know, why is that? are the weights being updated similarly in both the models? Are there any differences?

``````def forward(self, x, labels=None, return_cam=False):
batch_size = x.shape
x1 = self.features(x)
x1 = self.conv6(x1)
x1 = self.relu(x1)
x1 = self.conv7(x1)
x1 = self.relu(x1)

x2 = self.features(x)
x2 = self.conv8(x2)
x2 = self.relu(x2)
x2 = self.conv9(x2)
x2 = self.relu(x2)

x3 = self.features(x)
x3 = self.conv10(x3)
x3 = self.relu(x3)
x3 = self.conv11(x3)
x3 = self.relu(x3)

x = x1 + x2 + x3
``````

(I am sorry if i am bothering you.)

The updates should be equal, as I also get exactly the same gradients for all parameters in these models.
How reproducible is the accuracy difference? I.e. are you always seeing this gap for e.g. 10 runs with different seeds?

I will have to check that!

The gap might be of course just due to “bad luck”, but if you are seeing a consistent difference, something else is going on I’m missing.