Model performance worsened by model.eval()

[edit] I have found a plausible solution to my issue in the following thread: Model.eval() gives incorrect loss for model with batchnorm layers

There, ptrblck says:
" The high validation loss is due to the wrong estimates of the running stats.
Since you are feeding a constant tensor ( batchone : mean=1, std=0) and a random tensor ( batchtwo : mean~=0, std~=1), the running estimates will be shaky and wrong for both inputs.

During training the current batch stats will be used to compute the output, so that the model might converge.
However, during evaluation the batchnorm layer tries to normalize both inputs with skewed running estimates, which yields the high loss values.
Usually we assume that all inputs are from the same domain and thus have approx. the same statistics.

If you set track_running_stats=False in your BatchNorm layer, the batch statistics will also be used during evaluation, which will reduce the eval loss significantly."

I am trying to train an ESRGAN model, and for debugging purposes, I print the discriminator’s predictions on both training and validation examples at regular intervals. I have had issues with my discriminator not training properly, so in an attempt to verify my implementation, I have been trying to train my network on copies of just a single image. In that case, it should be pretty simple for the discriminator to do its job well. Indeed, I find that the discriminator does a good job during training, but curiously, its predictions are bad during validation. My validation set consists of copies of the same image as in the training set, so there really should be no significant difference in performance during validation. (Indeed, I could in principle have just dropped the whole validation code, but I did not bother to comment it out.)

To clarify what I mean by good and bad performance: a typical prediction during my latest run was +10 for a real example and -10 for a fake example, whereas during validation, a typical prediction was -4 for a real example and -12 for a fake example. Large positive predictions for real examples and large negative predictions for fake examples are desired.

As far as I can tell, the only difference between my training process and my validation process is that I call model.train() before training and model.eval() before validation. Is it reasonable that calling model.eval() can cause such drastic changes in model performance? (Remember that both training and validation is done using copies of the same images.)

I understand that the effects of model.eval() depends on which layers are used in the model, so here is my discriminator architecture:

VGG128Discriminator(
(features): Sequential(
(0): Conv2dBlock(
(block): Sequential(
(0): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
(3): Conv2d(128, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.2)
)
)
(1): Conv2dBlock(
(block): Sequential(
(0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
(3): Conv2d(256, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.2)
)
)
(2): Conv2dBlock(
(block): Sequential(
(0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
(3): Conv2d(512, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.2)
)
)
(3): Conv2dBlock(
(block): Sequential(
(0): Conv2d(512, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
(3): Conv2d(1024, 1024, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(4): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.2)
)
)
(4): Conv2dBlock(
(block): Sequential(
(0): Conv2d(1024, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
(3): Conv2d(1024, 1024, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(4): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.2)
)
)
)
(classifier): Sequential(
(0): Linear(in_features=16384, out_features=100, bias=True)
(1): LeakyReLU(negative_slope=0.2)
(2): Linear(in_features=100, out_features=1, bias=True)
)
)