What can modify BatchNorm1d._parameters['weight']?

While playing around with TabNet and optuna, we stumbled upon some crashes. The
crashes happen with both CUDA and CPU back-ends, both when using n_jobs=-1 and n_jobs=1 in optuna, both in Google Colab and on a PC (Windows 11, 64-bit Intel processor), but there doesn’t seem to be any pattern to them. Sometimes, it crashes right away. Other times, many batches succeed until one fails. Typically, a fit operation in a fresh process doesn’t crash on the same subset of data with the same hyper-parameters, but if we keep optimising hyper-parameters, eventually, it fails.

When using the CPU back-end, the crash can be traced to data coming out
of BatchNorm1d with a few columns of NaNs in them, because
initial_bn._parameters['weight'] consists of mostly ones, but also
three NaNs for some reason. Here’s what vars(self.initial_bn) looks like
in Python debugger:

ipdb> vars(self.initial_bn)
{'training': False, '_parameters': OrderedDict([('weight', Parameter containing:
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., nan, 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., # <-- here
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., nan, nan], # <-- and here
       requires_grad=True)), ('bias', Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0.], requires_grad=True))]), '_buffers': OrderedDict([('running_mean', tensor([7.9071e-01, 7.1917e-01, 1.5310e+00, 3.7201e-01, 8.5547e-01, 1.0965e+00,
        1.4760e-01, 3.3285e-01, 3.1628e-01, 1.1597e-01, 7.2294e-02, 1.1446e-01,
        8.8108e-02, 2.4851e-02, 1.4308e-02, 1.3066e+00, 5.6555e-01, 1.8375e-01,
        7.3461e-01, 3.0939e+04, 7.9071e-01, 6.7323e-01, 1.3804e+00, 3.5469e-01,
        8.9087e-01, 1.1529e+00, 1.8299e-01, 3.1327e-01, 3.6297e-01, 1.2049e-01,
        7.3800e-02, 1.1672e-01, 1.0543e-01, 2.7863e-02, 2.1839e-02, 2.2592e-03,
        8.2836e-03, 2.2592e-03, 7.5306e-04, 7.5306e-04, 7.5306e-04, 7.5306e-04,
        0.0000e+00, 0.0000e+00, 1.3246e+00, 6.9206e-01, 2.6734e-01, 8.2272e-01,
        1.2050e+05, 7.1614e+03])), ('running_var', tensor([6.3260e-01, 7.9830e-01, 2.6399e+00, 9.1715e-01, 3.8547e+00, 7.5853e+00,
        7.6188e-01, 1.8178e+00, 3.4969e+00, 2.1489e+00, 6.6469e-01, 1.0665e+00,
        1.3749e+00, 5.8826e-01, 5.7863e-01, 1.2853e+00, 1.0577e+00, 6.6113e-01,
        1.2533e+00, 1.6644e+09, 6.3260e-01, 8.7741e-01, 3.1752e+00, 9.2798e-01,
        3.7487e+00, 7.2443e+00, 7.6660e-01, 1.8189e+00, 3.4726e+00, 2.1510e+00,
        6.6571e-01, 1.0555e+00, 1.3847e+00, 5.9092e-01, 5.7349e-01, 5.5493e-01,
        5.6083e-01, 5.5493e-01, 5.5344e-01, 5.5344e-01, 5.5344e-01, 5.5344e-01,
        5.5268e-01, 5.5268e-01, 1.2935e+00, 1.3426e+00, 6.6043e-01, 1.4785e+00,
        7.4687e+12, 3.4789e+07])), ('num_batches_tracked', tensor(59))]), '_non_persistent_buffers_set': set(), '_backward_hooks': OrderedDict(), '_is_full_backward_hook': None, '_forward_hooks': OrderedDict(), '_forward_pre_hooks': OrderedDict(), '_state_dict_hooks': OrderedDict(), '_load_state_dict_pre_hooks': OrderedDict(), '_modules': OrderedDict(), 'num_features': 50, 'eps': 1e-05, 'momentum': 0.01, 'affine': True, 'track_running_stats': True}

How could this happen? What can I do to trace the source of NaNs?

(When using the CUDA back-end, I can’t see anything useful in the debugger (not
even contents of the BatchNorm1d that I suspect), because any access to
tensors on the GPU results in an error, but Colab GPU assertion log tells me that the error is similar: an all-NaN matrix is sorted, converted to indices of maxima (which are all 0), then 1 is subtracted from it for some reason, then -1 is used as an index to an array, failing the assertion. Then I have to restart the Python process before I stop getting CUDA error: device-side assert triggered for any GPU-related operation.)

What can result in ._parameters['weight'] of a BatchNorm1d changing? As
far as I was able to trace, they are never purposefully changed in Python code, and
they always seem to be passed as a const reference to C++ code. I’m not desperate enough to run my neural networks under Valgrind or AddressSanitizer or gdb with watchpoints, yet.

Here are the versions we’re using:

Google Colab PC
PyTorch 1.10.0+cu111 1.8.1
optuna 2.10.0 2.10.0
pytorch-tabnet 3.1.1 3.1.1

Update: I’ve been told that bias and weight are learnable parameters of BatchNorm1d layers, which means I might be getting gradient explosion. But this is batch 0; how can I get some visibility into why the parameter update happens the way it does?

Yes, the weight and bias are the trainable affine parameters.
Check the values directly after their initialization and make sure they are valid. Afterwards, check the gradients of these parameters and see if these are containing any Infs or NaNs.

Thanks! I’ve ran the network with torch.autograd.set_detect_anomaly(True) and got a crash much earlier, during a reverse step; also found out that some infs found their way into the gradient before the process crashed.

I guess that answers my original question. Now to find out how the gradients got so large…