Trained ResNet doesn't work in eval mode, behaves strangely

Thanks for the update. This artifact might then be caused by different distributions even in the training data. It might be similar to this behavior, i.e. the batchnorm stats would converge to the mean of all samples, while your dataset samples might be coming from distributions with different means.
During the training the activations would be normalized using the batch stats, while the running stats would thus be off.