Model predicts correct values when used first time but eventually predictions get worse

Hello everyone! I have a problem with a resnet50 classifier that I trained, its training loss and validation loss were in order and I can say with high confidence that it isn’t overfitting.

Epoch : 050, Training: Loss: 0.9120, Accuracy: 100.0000%, 
		Validation : Loss : 0.9948, Accuracy: 92.9577%

Also, it gave all the correct predictions when I used it for inference for the first time but after that, it started giving me wrong predictions. I’ll admit I forgot the model.eval() at first but the prediction was done under the “with torch.no_grad():” so I don’t think it messed up the gradients. Can someone please tell me what’s happening and how I can fix it?

P.S — Images are RGB

Code for model loading:

class ClfNetwork(nn.Module):
    def __init__(self):
        super(ClfNetwork, self).__init__()
        self.model = models.resnet50(pretrained = False)
        for param in self.model.parameters():
            param.requires_grad = False
        self.model.fc = nn.Sequential(nn.Linear(2048, 5),
                               nn.Softmax(dim = 1))
    def forward(self, x):
        y = self.model(x)
        return y
clf = ClfNetwork().to(DEVICE)
clf.load_state_dict(torch.load('path_to_model.pth')) # <All keys matched successfully>

Transforms: (I used albumentations)

ctfms = A.Compose([
    A.Resize(D, D), # D is 224
    A.Normalize(mean = 0.0, std = 1.0, max_pixel_value = 255.0),

Prediction Function:

def predict_class(model, transform, Id):
    test_image = io.imread(f'path_to_image_{Id}.tiff')
    test_image_tensor = transform(image = test_image)["image"]
    test_image_tensor = test_image_tensor.view(1, 3, 224, 224).to(DEVICE) # DEVICE is GPU
    with torch.no_grad():
        out = model(test_image_tensor)
        ps = torch.exp(out)
        predicted_class = ps.argmax()
    return predicted_class

Do let me know if there are some other details that I need to share. Thanks!

The losses and accuracy shows a larger gap, so I would claim your model is overfitting to the training dataset. Especially a training accuracy of 100% doesn’t sound right compared to the lower validation accuracy.

That’s correct. Using torch.no_grad() will make sure that no gradients can be calculated. However, performing the forward passes in model.train() will e.g. update the running stats from batchnorm layers and I would thus consider it as a data leak.

I see. So there is overfitting. But I still am unable to understand how my model was able to work correctly the very first time and after that did not work at all? And also, what solution do you propose to overcome this problem? Please can you also tell me about the data leak, this is the first time I am encountering this issue.