Performance highly degraded when eval() is activated in the test phase

Yes, I’m sure that the input activation stats will be used to normalize it during training mode while also the running stats are updated using these calculated input stats and the momentum. I don’t think these statements contradict each other.

Here is an example which shows that the input activation is normalized using their own stats instead of the running stats:

# create input with a defined mean and std
mean = 5.
std = 10.
x = torch.randn(10, 3, 224, 224) * std + mean

print('mean {}, std {}'.format(x.mean([0, 2, 3]), x.std([0, 2, 3])))
# > mean tensor([5.0125, 5.0295, 4.9645]), std tensor([ 9.9943, 10.0157,  9.9935])

# apply bn in training mode
bn = nn.BatchNorm2d(3)

print('running_mean {}, running_var {}'.format(bn.running_mean, bn.running_var))
# > running_mean tensor([0., 0., 0.]), running_var tensor([1., 1., 1.])

bn.train()

# normalize input activation using input stats and update running stats
output = bn(x)
print('mean {}, std {}'.format(output.mean([0, 2, 3]), output.std([0, 2, 3])))
# > mean tensor([-3.2676e-08, -5.8388e-09,  8.8647e-09], grad_fn=<MeanBackward1>), std tensor([1.0000, 1.0000, 1.0000], grad_fn=<StdBackward>)

print('running_mean {}, running_var {}'.format(bn.running_mean, bn.running_var))
# > running_mean tensor([0.5013, 0.5029, 0.4964]), running_var tensor([10.8887, 10.9315, 10.8870])

If the running stats were used during training then the output tensor would not have been normalized, as the initial running stats contain a zero mean and a unit variance.