My Self Implemented BatchNorm + Relu gives NaN

Here is my own implementation of BatchNorm forward function:
def forward(self, x):
self._check_input_dim(x)

if self.training:
    N, C, H, W = x.size()
    x = x.transpose(0, 1).contiguous().view(C, -1)
    mu = x.mean(1, keepdim=True)
    sigma = x.var(1, keepdim=True)

    with torch.no_grad():
        self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mu
        self.running_var = (1 - self.momentum) * self.running_var + self.momentum * sigma

    x = x - mu
    x = x / (sigma.sqrt() + self.eps)
    x = x * self.weight + self.bias
    x = x.view(C, N, H, W).transpose(0, 1)
    return x

else:
    N, C, H, W = x.size()
    x = x.transpose(0, 1).contiguous().view(C, -1)
    x = (x - self.running_mean) / (self.running_var.sqrt() + self.eps)
    x = x * self.weight + self.bias
    x = x.view(C, N, H, W).transpose(0, 1)
    return x

However, during training (use Cifar10) it always break down after running for some epochs (Iā€™m using ResNet50, train from scratch) and gives NaN value for the loss. After debugging, I found the error comes from the affine operation: x = x * self.weight + self.bias
which produces all negative values. After the Relu Layer, you get all 0s. Next, after a Conv Layer you will get NaN.

Why Conv Layer gives NaN for all 0 input ?
How should this problem be solved ? Should I add eps after affine transformation?

Problem fixed by changing x = x / (sigma.sqrt() + self.eps) to x = x / (sigma + self.eps).sqrt().
This is because 1/|x| has no derivative at x=0, while 1/(sqrt(x^2 +eps)) has derivative at x=0.

1 Like