Hi, I want to implement BatchNorm1d, but the result is always a little bit different from the output of pytorch.

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
# BatchNorm1d:
# The mean and standard-deviation are calculated per-dimension over the mini-batches.
# Also by default, during training this layer keeps running estimates of its computed mean and variance,
# which are then used for normalization during evaluation. The running estimates are kept with a default momentum of 0.1.
conv1 = nn.Conv1d(4, 16, 1)
bn1 = nn.BatchNorm1d(16, eps=1e-5, momentum=1)
x = torch.rand(32, 4, 1)
input = conv1(x)
print("Input to batch norm:", input.shape)
after_norm = bn1(input)
print("Output of bn1:", after_norm[0].squeeze())
Ex = input.mean(dim=[0, 2], keepdim=True)
Varx = torch.sqrt(input.var(dim=[0, 2], keepdim=True))
# print(Ex.shape)
after_norm2 = ((input - Ex) / Varx + bn1.eps) * bn1.weight.unsqueeze(0).unsqueeze(-1) + bn1.bias.unsqueeze(0).unsqueeze(-1)
print("Manually output:", after_norm2[0].squeeze())
# test running mean
print("*" * 80)
print("Test Running Mean:")
print(input.mean(dim=(0, 2)))
print(bn1.running_mean)
print("Test Running Variance:")
print(input.var(dim=(0, 2)))
print(bn1.running_var)
```

The output is:

```
Input to batch norm: torch.Size([32, 16, 1])
Output of bn1: tensor([ 5.1425e-02, -1.1029e+00, 1.0990e+00, -7.5205e-04, 9.9263e-01,
1.1077e-01, -2.0028e-01, -2.5355e-01, 3.5257e-01, -2.5200e-01,
2.7035e-01, -1.3284e+00, -1.4589e+00, 4.0900e-02, 5.6800e-03,
1.9910e-01], grad_fn=<SqueezeBackward0>)
torch.Size([1, 16, 1])
Manually output: tensor([ 5.0625e-02, -1.0859e+00, 1.0833e+00, -7.4031e-04, 9.7721e-01,
1.0905e-01, -1.9715e-01, -2.4961e-01, 3.4707e-01, -2.4808e-01,
2.6616e-01, -1.3077e+00, -1.4362e+00, 4.0275e-02, 5.5915e-03,
1.9602e-01], grad_fn=<SqueezeBackward0>)
********************************************************************************
Test Running Mean:
tensor([ 0.0606, 0.0907, -0.0440, -0.3544, 0.2975, 0.8878, -0.3703, -0.2941,
-0.4312, 0.7303, 0.1518, 0.4278, 0.3618, -0.2783, -0.3981, -0.1884],
grad_fn=<MeanBackward2>)
tensor([ 0.0606, 0.0907, -0.0440, -0.3544, 0.2975, 0.8878, -0.3703, -0.2941,
-0.4312, 0.7303, 0.1518, 0.4278, 0.3618, -0.2783, -0.3981, -0.1884])
Test Running Variance:
tensor([0.0282, 0.0136, 0.0035, 0.0318, 0.0247, 0.0329, 0.0380, 0.0204, 0.0463,
0.0267, 0.0224, 0.0332, 0.0319, 0.0167, 0.0294, 0.0222],
grad_fn=<VarBackward1>)
tensor([0.0282, 0.0136, 0.0035, 0.0318, 0.0247, 0.0329, 0.0380, 0.0204, 0.0463,
0.0267, 0.0224, 0.0332, 0.0319, 0.0167, 0.0294, 0.0222])
```