Getting different accuracy using similar models but why?

Umair_Javaid · May 13, 2020, 12:22am

model1:

def forward(self, x, labels=None, return_cam=False):
        batch_size = x.shape[0]
        x = self.features(x)
        
        x1 = self.conv6(x)
        x1 = self.relu(x1)
        x1 = self.conv7(x1)
        x1 = self.relu(x1)
        
        x2 = self.conv8(x)
        x2 = self.relu(x2)
        x2 = self.conv9(x2)
        x2 = self.relu(x2)

        x3 = self.conv10(x)
        x3 = self.relu(x3)
        x3 = self.conv11(x3)
        x3 = self.relu(x3)

        x = x1 + x2 + x3

model2:

def forward(self, x, labels=None, return_cam=False):
        batch_size = x.shape[0]
        x1 = self.features(x)
        x1 = self.conv6(x1)
        x1 = self.relu(x1)
        x1 = self.conv7(x1)
        x1 = self.relu(x1)
        
        x2 = self.features(x)
        x2 = self.conv8(x2)
        x2 = self.relu(x2)
        x2 = self.conv9(x2)
        x2 = self.relu(x2)

        x3 = self.features(x)
        x3 = self.conv10(x3)
        x3 = self.relu(x3)
        x3 = self.conv11(x3)
        x3 = self.relu(x3)

        x = x1 + x2 + x3

self.features contain pretrained vgg layers
I get a little better accuracy using model2. Why is that?

ptrblck · May 13, 2020, 3:24am

If self.features contains e.g. batchnorm layers, you would update the running statistics three times in model2, which might improve the training accuracy.
Let me know, if that’s not the case and we can dig a bit deeper.

Umair_Javaid · May 13, 2020, 11:22am

There is no batch_normalization. But, after block 3, 4 and 5, I am taking sigmoid of the feature maps and then multiplying the resultant with the feature maps.

attention = torch.mean(input_, dim=1, keepdim=True)
importance_map = torch.sigmoid(attention)
return input_.mul(importance_map)

could that be the reason?
Another question, why is self.features being updated three times in model2 but only once in model1? Is the flow of gradients not the same for both the models?

ptrblck · May 14, 2020, 12:51am

The gradient flow should be the same and I was only referring to the updates of the internal batchnorm statistics.

The additional operation using the importance_map might of course change the performance of the model.
If I understand the use case correctly, you are using this attention op in the second approach, but not in the first one?

Umair_Javaid · May 14, 2020, 1:45pm

No, I am using attention op in both the models. self.features is identical for both the approaches.
Can you explain me the differences between both the models?

ptrblck · May 15, 2020, 12:53am

Could you post the complete architecture, please?

ptrblck · May 15, 2020, 5:40am

Thanks for the code.
I’ve added myModel2 as:

class myModel2(nn.Module):
    def __init__(self, features, num_classes=200, **kwargs):
        super(myModel2, self).__init__()
        self.features = features
        self.conv6 = nn.Conv2d(512,  1024, kernel_size=3, padding=1) 
        self.conv7 = nn.Conv2d(1024, num_classes, kernel_size=1)
        self.conv8 = nn.Conv2d(512,  1024, kernel_size=3, padding=1) 
        self.conv9 = nn.Conv2d(1024, num_classes, kernel_size=1)
        self.conv10 = nn.Conv2d(512,  1024, kernel_size=3, padding=1) 
        self.conv11 = nn.Conv2d(1024, num_classes, kernel_size=1)
        self.conv12 = nn.Conv2d(512,  1024, kernel_size=3, padding=1) 
        self.conv13 = nn.Conv2d(1024, num_classes, kernel_size=1)
        self.relu = nn.ReLU(inplace=False)
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        #self.fc = nn.Linear(1024, num_classes)
        initialize_weights(self.modules(), init_mode='he')
    def forward(self, x, labels=None, return_cam=False):
        batch_size = x.shape[0]
        x = self.features(x)
        
        x1 = self.conv6(x)
        x1 = self.relu(x1)
        x1 = self.conv7(x1)
        x1 = self.relu(x1)
        #x2 = self.features(x)
        x2 = self.conv8(x)
        x2 = self.relu(x2)
        x2 = self.conv9(x2)
        x2 = self.relu(x2)
        #x3 = self.features(x)
        x3 = self.conv10(x)
        x3 = self.relu(x3)
        x3 = self.conv11(x3)
        x3 = self.relu(x3)
        #x4 = self.features(x)
        x4 = self.conv12(x)
        x4 = self.relu(x4)
        x4 = self.conv13(x4)
        x4 = self.relu(x4)
        x = x1 + x2 + x3 + x4
        if return_cam:
          normalized_feature_map = normalize_tensor(x.detach().clone())
          cams = normalized_feature_map[range(batch_size), labels]
          return cams

        
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        return {'logits': x}

and get exactly the same results using this code:

features = nn.Conv2d(1, 512, 1)
model1 = myModel7(features)
model2 = myModel2(features)
model2.load_state_dict(model1.state_dict())

x = torch.randn(1, 1, 4, 4)
out1 = model1(x)
out2 = model2(x)

print((out1['logits'] - out2['logits']).abs().max())
> tensor(0., grad_fn=<MaxBackward1>)

Umair_Javaid · May 15, 2020, 6:05am

I am sure the output is the same.

But after training both the models on CUB dataset, myModel6 gets 2.1% increase in classification accuracy compared to the other model. I just want to know, why is that? are the weights being updated similarly in both the models? Are there any differences?

def forward(self, x, labels=None, return_cam=False):
        batch_size = x.shape[0]
        x1 = self.features(x)
        x1 = self.conv6(x1)
        x1 = self.relu(x1)
        x1 = self.conv7(x1)
        x1 = self.relu(x1)
        
        x2 = self.features(x)
        x2 = self.conv8(x2)
        x2 = self.relu(x2)
        x2 = self.conv9(x2)
        x2 = self.relu(x2)

        x3 = self.features(x)
        x3 = self.conv10(x3)
        x3 = self.relu(x3)
        x3 = self.conv11(x3)
        x3 = self.relu(x3)

        x = x1 + x2 + x3

(I am sorry if i am bothering you.)

ptrblck · May 15, 2020, 6:21am

The updates should be equal, as I also get exactly the same gradients for all parameters in these models.
How reproducible is the accuracy difference? I.e. are you always seeing this gap for e.g. 10 runs with different seeds?

Umair_Javaid · May 15, 2020, 6:26am

I will have to check that!

ptrblck · May 15, 2020, 6:32am

The gap might be of course just due to “bad luck”, but if you are seeing a consistent difference, something else is going on I’m missing.