Yes, I’m sure that the input activation stats will be used to normalize it during training mode while also the running stats are updated using these calculated input stats and the momentum. I don’t think these statements contradict each other.
Here is an example which shows that the input activation is normalized using their own stats instead of the running stats:
# create input with a defined mean and std
mean = 5.
std = 10.
x = torch.randn(10, 3, 224, 224) * std + mean
print('mean {}, std {}'.format(x.mean([0, 2, 3]), x.std([0, 2, 3])))
# > mean tensor([5.0125, 5.0295, 4.9645]), std tensor([ 9.9943, 10.0157, 9.9935])
# apply bn in training mode
bn = nn.BatchNorm2d(3)
print('running_mean {}, running_var {}'.format(bn.running_mean, bn.running_var))
# > running_mean tensor([0., 0., 0.]), running_var tensor([1., 1., 1.])
bn.train()
# normalize input activation using input stats and update running stats
output = bn(x)
print('mean {}, std {}'.format(output.mean([0, 2, 3]), output.std([0, 2, 3])))
# > mean tensor([-3.2676e-08, -5.8388e-09, 8.8647e-09], grad_fn=<MeanBackward1>), std tensor([1.0000, 1.0000, 1.0000], grad_fn=<StdBackward>)
print('running_mean {}, running_var {}'.format(bn.running_mean, bn.running_var))
# > running_mean tensor([0.5013, 0.5029, 0.4964]), running_var tensor([10.8887, 10.9315, 10.8870])
If the running stats were used during training then the output
tensor would not have been normalized, as the initial running stats contain a zero mean and a unit variance.