Batch norm instability

I am facing the same problem, and I guess that the reason is due to for some batches and some layers, the inputs are all negative before relu and thus become zero, for depthwise layers. These will make the running stats gradually become very small values, and thus making the weights very large. Large weights typically will cause overfitting issue, and this is the reason why the training loss is small but validation loss is huge. I checked it and as the running_var is 5.6052e-45, the ratio between weight and bn std (which will become 3.162e-3, i.e. sqrt(1e-5) as bn.eps is 1e-5) will be around thousands. I wonder what will be the proper way if for some batch all inputs are zero. I tried to set bn.eps to 0.1 but although the weight are not huge, there is still strong overfitting (because in this case the BN might not work properly as the running stats are not collected correctly). BTW, I did not apply weight decay for depthwise conv weights.