`F.batch_norm` returns different results in `train` and `eval` mode given same setup

    1. set seed
import torch
import torch.nn as nn
import torch.nn.functional as F

from pytorch_lightning.utilities.seed import seed_everything
seed_everything(666)

num=6
x = torch.rand(4,num)
print(x)
>>> tensor([[0.3119, 0.2701, 0.1118, 0.1012, 0.1877, 0.0181],
        [0.3317, 0.0846, 0.5732, 0.0079, 0.2520, 0.5518],
        [0.8785, 0.5281, 0.4961, 0.9791, 0.5817, 0.4875],
        [0.0650, 0.7506, 0.2634, 0.3684, 0.5035, 0.9089]])
    1. randomly init a BN
bn = nn.BatchNorm1d(num)

rand = lambda num: torch.rand(num)

weight = rand(num)
bias = rand(num)
mean = rand(num)
var = rand(num)

bn.weight.data = weight.data
bn.bias.data = bias.data
bn.running_mean.data = mean.data
bn.running_var.data = var.data
    1. compare different ops
bn.eval()
y1 = bn(x)

>>> tensor([[ 0.1871, -0.7274, -0.2401,  0.2894,  0.8241,  0.0806],
        [ 0.2042, -1.5069,  0.6280,  0.2882,  0.8298,  0.6823],
        [ 0.6761,  0.3567,  0.4828,  0.3000,  0.8590,  0.6098],   
        [-0.0260,  1.2915,  0.0450,  0.2926,  0.8521,  1.0849]],
       grad_fn=<NativeBatchNormBackward>)

y2 = F.batch_norm(x, mean, var, weight, bias, eps=1e-5, momentum=0.1, training=False)
>>> tensor([[ 0.1871, -0.7274, -0.2401,  0.2894,  0.8241,  0.0806],
        [ 0.2042, -1.5069,  0.6280,  0.2882,  0.8298,  0.6823],
        [ 0.6761,  0.3567,  0.4828,  0.3000,  0.8590,  0.6098],
        [-0.0260,  1.2915,  0.0450,  0.2926,  0.8521,  1.0849]])

We can see that the two forms return the same results.

But if we try F.batch_norm(..., training=True), we will get totally different results

y2 = F.batch_norm(x, mean, var, weight, bias, eps=1e-5, momentum=0.1, training=True)
>>> tensor([[ 0.1582, -0.3523, -0.2571,  0.2823,  0.8138, -0.5995],
        [ 0.1835, -0.9058,  0.7885,  0.2801,  0.8397,  0.7659],
        [ 0.8830,  0.4174,  0.6137,  0.3029,  0.9721,  0.6014],
        [-0.1577,  1.0812,  0.0864,  0.2886,  0.9407,  1.6796]])

My question is what is the role of training in F.batch_norm?

I get the same expected results if both approaches are using eval and training mode, respectively.

During training the batch stats will be used to normalize the input (and the running stats will be updated) while the running stats will be used to normalize the input during eval.
The docs explain this too and also give the formula how the running stats are updated.