BatchNorm and zero/nan input: Different behaviour of eval() and train() modes even after model is broken

Most people struggling nans at the output know that when you feed nan or zero input to the model which has layers such as batchnorm, then the model may become broken. During the training, if you do not do backward pass, the model does not break down. However, since you already did a forward pass with these undesired inputs once, the eval mode outputs nan while the train mode does not. Any explanation on this?

The running stats of batchnorm layers (running_mean and running_var) will be poisoned by the invalid values and will thus yield invalid outputs during model.eval() forward passes.

1 Like

Thanks for the reply. This was what I guessed. Apart from not feeding zero/nan inputs, are there any guard rails to use such as being able to automatically ignore such samples? Or a mechanism to avoiding updating the running_mean or running_var?

I’m not aware of any other approaches that to make sure these invalid values are filtered out (e.g. via torch.isfinite(input)).

1 Like