Loss not decreasing despite reducing model complexity

I have been trying to replicate a paper and build the same model but with few changes. Adding Non-Linear Contrastive loss (Lossless triplet loss) and better data augmentation and I have not been able to get past the 70% accuracy mark on the test set and also the test loss doesn’t seem to decrease despite 20+ epochs of training on using the standard contrastive loss as well as the Lossless Triplet Loss. I have been forced use the learning rate <= 1e-5 as anything grater results in the model predicting mean values of the bias for any input. I have also implements weight regularization and reduced the model complexity severely (250M+ to 4M+ parameters). The Model seems to overfit as only the train loss seems to be going down and the test loss is increasing.
The Model :

class PhiNet(nn.Module):
    def __init__(self, ):
        super(PhiNet, self).__init__()
        self.layer1 = nn.Sequential(
                    nn.Conv2d(1,96,kernel_size=11,stride=1,padding=1),
                    nn.ReLU(),
                    nn.BatchNorm2d(96,eps=1e-06, momentum=0.9),
                    nn.MaxPool2d(kernel_size=3, stride=2))
  

        self.layer2 = nn.Sequential(
                    nn.Conv2d(96, 256, kernel_size=5, stride=1, padding=2),
                    nn.ReLU(),
                    nn.BatchNorm2d(256,eps=1e-06, momentum=0.9),
                    nn.MaxPool2d(kernel_size=3, stride=2),
                    nn.Dropout2d(p=0.3))
        
        self.layer3 = nn.Sequential(
                    nn.Conv2d(256,324, kernel_size=3, stride=1, padding=1),
                    nn.ReLU()
                    )
        
        self.layer4 = nn.Sequential(
                    nn.Conv2d(324,64, kernel_size=3, stride=1),
                    nn.ReLU(),
                    nn.MaxPool2d(kernel_size=3, stride=2),
                    nn.Dropout2d(p=0.3))
        
        self.layer5 = nn.Sequential(
                      nn.Conv2d(64,32, kernel_size=3, stride=1),
                      nn.ReLU(),
                      nn.MaxPool2d(kernel_size=3, stride=2),
                      nn.Dropout2d(p=0.3))

        
        self.layer6 = nn.Sequential(
                    nn.Linear(2880,1024),
                    nn.ReLU(),
                    nn.Dropout(p=0.6))
        
        self.layer7 = nn.Sequential(
            nn.Linear(1024,128),
            nn.Sigmoid())
        
        for m in self.modules():
          if isinstance(m, nn.Conv2d):
              nn.init.kaiming_normal_(m.weight, mode='fan_in')
               
    def forward(self, x):
        out = self.layer1(x)
        #print (out.size())
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = self.layer5(out

        out = out.reshape(out.size()[0], -1)
        #FC
        out = self.layer6(out)
        out = self.layer7(out)

        
        return out

Loss:

class ContrastiveLoss(torch.nn.Module):
    """
    Contrastive loss function.
    Based on: http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
    """

    def __init__(self, margin=2.0):
        super(ContrastiveLoss, self).__init__()
        self.margin = margin
        self.eps=1e-5

    def forward(self, output1,output2, label):
        euclidean_distance = F.pairwise_distance(output1, output2)
        loss_contrastive = torch.mean((1-label) * torch.pow(euclidean_distance, 2) +
                                      (label) * torch.pow(torch.clamp(self.margin - euclidean_distance, min=0.0), 2))


        return loss_contrastive

Model Summary:

from torchsummary import summary
summary(phinet, (1,300, 150))
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 96, 292, 142]          11,712
              ReLU-2         [-1, 96, 292, 142]               0
       BatchNorm2d-3         [-1, 96, 292, 142]             192
         MaxPool2d-4          [-1, 96, 145, 70]               0
            Conv2d-5         [-1, 256, 145, 70]         614,656
              ReLU-6         [-1, 256, 145, 70]               0
       BatchNorm2d-7         [-1, 256, 145, 70]             512
         MaxPool2d-8          [-1, 256, 72, 34]               0
         Dropout2d-9          [-1, 256, 72, 34]               0
           Conv2d-10          [-1, 324, 72, 34]         746,820
             ReLU-11          [-1, 324, 72, 34]               0
           Conv2d-12           [-1, 64, 70, 32]         186,688
             ReLU-13           [-1, 64, 70, 32]               0
        MaxPool2d-14           [-1, 64, 34, 15]               0
        Dropout2d-15           [-1, 64, 34, 15]               0
           Conv2d-16           [-1, 32, 32, 13]          18,464
             ReLU-17           [-1, 32, 32, 13]               0
        MaxPool2d-18            [-1, 32, 15, 6]               0
        Dropout2d-19            [-1, 32, 15, 6]               0
           Linear-20                 [-1, 1024]       2,950,144
             ReLU-21                 [-1, 1024]               0
          Dropout-22                 [-1, 1024]               0
           Linear-23                  [-1, 128]         131,200
          Sigmoid-24                  [-1, 128]               0
================================================================
Total params: 4,660,388
Trainable params: 4,660,388
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.17
Forward/backward pass size (MB): 182.64
Params size (MB): 17.78
Estimated Total Size (MB): 200.59
----------------------------------------------------------------

And here is the model training:

  0%|          | 0/480 [00:00<?, ?it/s]


Epoch: 0

100%|██████████| 480/480 [02:07<00:00,  3.81it/s]
100%|██████████| 160/160 [00:15<00:00, 11.36it/s]
100%|██████████| 160/160 [00:15<00:00, 11.31it/s]
  0%|          | 0/480 [00:00<?, ?it/s]

[0.25258303 0.18019645 0.33193994 0.22606815 0.4684852 ]
Accuracy:62.989 Threshold:0.260
Saving..
Train Loss:  1.0204066408177217
Test Loss:  1.2720871403813363

Epoch: 1

100%|██████████| 480/480 [02:07<00:00,  3.79it/s]
100%|██████████| 160/160 [00:15<00:00, 11.28it/s]
100%|██████████| 160/160 [00:15<00:00, 11.69it/s]
  0%|          | 0/480 [00:00<?, ?it/s]

[1.0923935  0.52174014 0.58343613 0.5127276  0.09465909]
Accuracy:60.300 Threshold:0.390
Saving..
Train Loss:  0.7175934920708339
Test Loss:  1.2123102966696024

Epoch: 2

100%|██████████| 480/480 [02:07<00:00,  3.81it/s]
100%|██████████| 160/160 [00:15<00:00, 11.77it/s]
100%|██████████| 160/160 [00:15<00:00, 11.27it/s]

[0.20261455 0.3708717  0.37443015 1.8816589  2.446019  ]
Accuracy:62.704 Threshold:0.450
Saving..

  0%|          | 0/480 [00:00<?, ?it/s]

Train Loss:  0.5991284149698913
Test Loss:  1.0973380882292987

Epoch: 3

100%|██████████| 480/480 [02:07<00:00,  3.78it/s]
100%|██████████| 160/160 [00:15<00:00, 11.58it/s]
100%|██████████| 160/160 [00:15<00:00, 11.56it/s]
  0%|          | 0/480 [00:00<?, ?it/s]

[2.9831722  0.46837586 0.86912566 0.6062222  0.26007086]
Accuracy:62.067 Threshold:0.320
Train Loss:  0.5078814978090426
Test Loss:  1.1578331850469112

I would really appreciate any kind of input. Thank you!

@ptrblck @smth h I have changed even the model, the accuracy seems to hover around the 60-70% mark. Any help would be appreciated

Are you sure you’re just not in a local minima, have you player with the lr?