Here is my own implementation of BatchNorm forward function:

def forward(self, x):

self._check_input_dim(x)

```
if self.training:
N, C, H, W = x.size()
x = x.transpose(0, 1).contiguous().view(C, -1)
mu = x.mean(1, keepdim=True)
sigma = x.var(1, keepdim=True)
with torch.no_grad():
self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mu
self.running_var = (1 - self.momentum) * self.running_var + self.momentum * sigma
x = x - mu
x = x / (sigma.sqrt() + self.eps)
x = x * self.weight + self.bias
x = x.view(C, N, H, W).transpose(0, 1)
return x
else:
N, C, H, W = x.size()
x = x.transpose(0, 1).contiguous().view(C, -1)
x = (x - self.running_mean) / (self.running_var.sqrt() + self.eps)
x = x * self.weight + self.bias
x = x.view(C, N, H, W).transpose(0, 1)
return x
```

However, during training (use Cifar10) it always break down after running for some epochs (Iām using ResNet50, train from scratch) and gives NaN value for the loss. After debugging, I found the error comes from the affine operation: x = x * self.weight + self.bias

which produces all negative values. After the Relu Layer, you get all 0s. Next, after a Conv Layer you will get NaN.

Why Conv Layer gives NaN for all 0 input ?

How should this problem be solved ? Should I add eps after affine transformation?