Loaded model gives much worse accuracy than trianed model

Hi all,

I am trying to test an image denoising model with the same validation set that was used in training but my results are much worse than in training. During training the validation dataset gave me 37 dB and after loading the model, I get only 25 dB.

I have used this testing script before and it usually works as expected. I do the usual stuff of setting the model to eval() and do: with no grad:

The only notable difference with this model is that I have implemented a custom batchnorm module from a paper I read. The module is as follows:

class BFBatchNorm2d(nn.BatchNorm2d):
    def __init__(self, num_features, eps=1e-5, momentum=0.1, use_bias = False, affine=True):
        super(BFBatchNorm2d, self).__init__(num_features, eps, momentum)
        self.use_bias = use_bias
        # self.training = True
        # self.affine=False
    def forward(self, x):
        y = x.transpose(0,1)
        return_shape = y.shape
        y = y.contiguous().view(x.size(1), -1)
        if self.use_bias:
            mu = y.mean(dim=1)
        sigma2 = y.var(dim=1)

        if self.training is not True:
            if self.use_bias:
                y = y - self.running_mean.view(-1, 1)
            y = y / ( self.running_var.view(-1, 1)**0.5 + self.eps)
            if self.track_running_stats is True:
                with torch.no_grad():
                    if self.use_bias:
                        self.running_mean = (1-self.momentum)*self.running_mean + self.momentum * mu
                    self.running_var = (1-self.momentum)*self.running_var + self.momentum * sigma2
            if self.use_bias:
                y = y - mu.view(-1,1)
            y = y / (sigma2.view(-1,1)**.5 + self.eps)

        if self.affine:
            y = self.weight.view(-1, 1) * y;
            if self.use_bias:
                y += self.bias.view(-1, 1)

        return y.view(return_shape).transpose(0,1)

I have checked and confirmed that the .ckpt file contains the weights, biases, runnning mean and running variance of every layer that employs this BN layer. It seems to me that the weights are being correctly loaded. I have also made sure data paths have not been mixed up and dataloaders are identical.

Another detail is that this model was trained in Lightning but I have tried loading the without any lightning modules, and defined the model as a normal python module, and my output PSNR is the same.

Any ideas would be appreciated because I have exhausted all possibilities.