Model accuracy increases gradually in `model.eval()`

I have modified the fc layer in a resnet18 model to output 4 logits in a classification problem. To test my model am using the following function.

def test_model(model, test_loader, train=False):
    with torch.no_grad():
        _ = model.train() if train else model.eval()
        labs_list, prds_list, test_corrects = [], [], 0
        for i,(inps,labs) in enumerate(test_loader):
            inps, labs = inps.to(DEVICE), labs.to(DEVICE)
            outs = model(inps)
            prds = outs.argmax(dim=1)
            test_corrects += (prds==labs).sum().item()
    print(f'model.training = {str(model.training):>5},  ' \
                + f'test_acc = {test_corrects/len(test_loader.dataset):0.6f}')

The train argument in the function decides if model.train() is used or model.eval(). It is expected that model.eval() should always produce the same value of accuracy. However, if I run the following piece of code:

for i in range(10):
    test_model(net, test_loader)
    test_model(net, test_loader, train=True)

i.e. running the train and eval alternatively, accuracy in train model remains the same every time, but accuracy in eval mode gradually increases until it reaches a value, after which there is no further increase. This gradual saturation of accuracy is not observed if the train mode is removed from the for loop. This suggests that changes in the model after each train mode, affects the eval mode as well, which is really perplexing.

I suspect it might be wrong for the test_model function to call train mode inside torch.no_grad(). But I am still not sure how it affects the model performance in the eval mode.

This could be expected if you are using e.g. batchnorm layers, which update their running stats in each forward pass in training mode. The eval() mode would then potentially return different outputs until the running stats converged.

I am using Dropout in fc, so that must be the case. Thank you!
I suppose then that the correct value of accuracy for the model would be the one we get upon running test_model once in eval mode, and not the one obtained after running the for loop above. Also, is there any case where you would have to use train inside torch.no_grad()?