Training eval() vs Testing eval() results differ

I am trying to replicate the results I get on my evaluation dataset in training with the same dataset in testing. For context, I am trying to denoise images and they are being passed a simple NN consisting of a few conv, relu and BN layers.

My problem is that my results are quite close but there is always some disparity. For instance in training the model gets 37.49 dB and in testing it gets 37.40.

I have done a bit of reading and it seems that this has to do with batchnorm in the model but setting track_running_stats to false hasnt helped as some people suggest.

I also make sure to call model.eval() on my model and I use with torch.no_grad():

The only method I have found to bring the PSNR close is to test using a small minibatch (3 images) but this is not practical as the end goal is to process images seperately. I think this increase might also be due to the PSNR function itself processing larger batches at a time.