How to debug nan happening after hours of runtime?

I have a code right now that seems to run really well, but it have one problem - sometimes during training it returns nan.

Here is a snippet from my latest logfile, showing the mean loss of each epoch:

 0  Loss(train): 5.21e-01  Loss(val): 3.66e-01  
 1  Loss(train): 2.61e-01  Loss(val): 2.09e-01  
 2  Loss(train): 1.91e-01  Loss(val): 1.85e-01  
 3  Loss(train): 1.56e-01  Loss(val): 1.54e-01  
 4  Loss(train): 1.33e-01  Loss(val): 1.15e-01  
 5  Loss(train): 1.19e-01  Loss(val): 1.13e-01  
 6  Loss(train): 1.07e-01  Loss(val): 1.11e-01  
 7  Loss(train): 9.75e-02  Loss(val): 9.25e-02  
 8  Loss(train): 9.14e-02  Loss(val): 1.00e-01  
 9  Loss(train): 8.57e-02  Loss(val): 8.18e-02  
10  Loss(train): 7.88e-02  Loss(val): 7.34e-02 
11  Loss(train): 7.45e-02  Loss(val): 7.97e-02 
12  Loss(train): 7.18e-02  Loss(val): 6.37e-02 
13  Loss(train): 6.75e-02  Loss(val): 6.64e-02 
14  Loss(train): 6.79e-02  Loss(val): 6.55e-02 
15  Loss(train): 2.12e-01  Loss(val): 6.11e-01 
16  Loss(train): 3.95e-01  Loss(val): 7.29e-02 
17  Loss(train): 6.69e-02  Loss(val): 5.58e-02 
18  Loss(train): 5.95e-02  Loss(val): 5.19e-02 
19  Loss(train): nan  Loss(val): nan  
20  Loss(train): nan  Loss(val): nan  
21  Loss(train): nan  Loss(val): nan  
22  Loss(train): nan  Loss(val): nan

As far as I’m aware the network seems to be training decently and seems to be on the right track, and then suddenly it just goes to nan, it seems to be preventable by lowering the learning rate, but I’m hessitant to do this since it seems like the learning rate is decently set for the problem right now otherwise.

I have tried to prevent it so far by using:

            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0e-2, norm_type=2.0)
            optimizer.step()

but that doesn’t seem to work, I have also tried to use:

torch.autograd.set_detect_anomaly(True)

but no problems are showing up in the first could of iterations (I haven’t let it run until it hits the problem yet since it slows down the code quite a bit and in the past any errors have been caught in the initial iteration with detect_anomaly.

My next attempt will be to run the code until error with detect_anomaly, but that will take quite a while, are there any other standard approaches that I should attempt in order to figure out why my code suddenly goes nan in the middle of training?

You could create checkpoints (i.e. storing the state_dicts of the model, optimizer, etc.) and try to narrow down the iteration causing the issue. Assuming restoring the training routine works, you could then start the training from a few iterations before the invalid values are created and could then enable anomaly detection.

1 Like

I just came back to update this post and saw this reply, which is incidentally very close to what I have been doing.

My plan was to build in protecting in the model against the nans by saving the model_state_dict after each epoch and then if nans are detected in an epoch I would just reload the previous epochs model, lower the learning rate a bit and continue training.
However that doesn’t seem to work, because whenever I start to train with a model loaded like this the model parameters all goes to nan after using optimizer.step()

I don’t see how this can happen since the loss itself looks good, and I checked all the model weights before loss.backwards() and they also all looks good and reasonable, and I checked all the gradients of the weigths after loss.backwards(), but before optimizer.step() and they also all look good and reasonable, yet despite this everything goes nan.

When I reload the model I do not reload the optimizer (using adam), maybe that is the reason why this doesn’t work, though I can’t really see why this would be a problem.

Here is the method I use to check up on the model, this is used just after loss.backwards() and just before optimizer.step()

        weights = optimizer.param_groups[0]['params']
        weights_flat = [torch.flatten(weight) for weight in weights]
        weights_1d = torch.cat(weights_flat)
        assert not torch.isnan(weights_1d).any()
        assert not torch.isinf(weights_1d).any()
        print(f"{weights_1d.max()}, {weights_1d.min()}")

        grad_flat = [torch.flatten(weight.grad) for weight in weights]
        grad_1d = torch.cat(grad_flat)
        assert not torch.isnan(grad_1d).any()
        assert not torch.isinf(grad_1d).any()
        print(f"{grad_1d.max()}, {grad_1d.min()}")

Based on your suggestion I tried loading the model again, and enabling anomaly_detection, however strangely enough it didn’t seem to do anything, once again the model was loaded, data was passed through the model, a loss was generated and backwards propagated.

For the model weights / gradients - max / min I got:

8.083415985107422, -9.480425834655762
0.3138473927974701, -0.9406746625900269

Then I let the optimizer take a step and afterwards all the weights were nan again, but the anomaly detection didn’t seem to catch anything, strangely enough.

I will try to also save and load the optimizer in my current approach, but any other suggestions for how to debug this would be greatly appreciated.

I tried again with the saving loading of the optimizer as well, and while it seems to be able to sometimes pass through an epoch that it previously failed, it is not doing well, and once it hits nan it generally returns nan moving forward.

I managed to capture the error with detect_anomaly this time though, but unfortunately it doesn’t give me much to go on:


[W ..\torch\csrc\autograd\python_anomaly_mode.cpp:104] Warning: Error detected in NormBackward1. Traceback of forward call that caused the error:
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.1.2\plugins\python-ce\helpers\pydev\pydevd.py", line 2173, in <module>
    main()
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.1.2\plugins\python-ce\helpers\pydev\pydevd.py", line 2164, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.1.2\plugins\python-ce\helpers\pydev\pydevd.py", line 1476, in run
    return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.1.2\plugins\python-ce\helpers\pydev\pydevd.py", line 1483, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.1.2\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/Tue/PycharmProjects/ConstrainedMD/src/run_MD_e3_batch.py", line 99, in <module>
    results = main(c)
  File "C:\Users\Tue\PycharmProjects\ConstrainedMD\src\main.py", line 195, in main
    aloss_t, alossr_t, alossv_t, alossD_t, alossDr_t, alossDv_t, ap_t, MAEr_t, MAEv_t = run_network_e3(model, dataloader_train, train=True, max_samples=1e6, optimizer=optimizer, loss_fnc=c['loss'], batch_size=c['batch_size'], max_radius=cn['max_radius'], debug=c['debug'], log=LOG)
  File "C:\Users\Tue\PycharmProjects\ConstrainedMD\src\utils.py", line 195, in run_network_e3
    loss_v = torch.sum(torch.norm(Vpred - Vout_vec, p=2, dim=1)) / nb
  File "C:\Users\Tue\PycharmProjects\ConstrainedMD\venv\PycharmProjects\lib\site-packages\torch\functional.py", line 1420, in norm
    return _VF.norm(input, p, _dim, keepdim=keepdim)  # type: ignore
 (function _print_stack)
Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.1.2\plugins\python-ce\helpers\pydev\pydevd.py", line 1483, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.1.2\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/Tue/PycharmProjects/ConstrainedMD/src/run_MD_e3_batch.py", line 99, in <module>
    results = main(c)
  File "C:\Users\Tue\PycharmProjects\ConstrainedMD\src\main.py", line 195, in main
    aloss_t, alossr_t, alossv_t, alossD_t, alossDr_t, alossDv_t, ap_t, MAEr_t, MAEv_t = run_network_e3(model, dataloader_train, train=True, max_samples=1e6, optimizer=optimizer, loss_fnc=c['loss'], batch_size=c['batch_size'], max_radius=cn['max_radius'], debug=c['debug'], log=LOG)
  File "C:\Users\Tue\PycharmProjects\ConstrainedMD\src\utils.py", line 237, in run_network_e3
    loss.backward()
  File "C:\Users\Tue\PycharmProjects\ConstrainedMD\venv\PycharmProjects\lib\site-packages\torch\tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "C:\Users\Tue\PycharmProjects\ConstrainedMD\venv\PycharmProjects\lib\site-packages\torch\autograd\__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function 'NormBackward1' returned nan values in its 0th output.

I have been trying to figure out what is meant by its 0th output, but haven’t found much info anywhere on detect_anomaly.

I figured out my problem.

I have a trainable semi-unitary matrix that I iteratively create, and it seems like the iterative process is unstable if the training pushes that matrix too far away from its initial condition. The result is that suddenly the model returns nans even though all weights in the model appear reasonable.

1 Like

Thanks for the update and the great debugging!
Just a follow up question, since I’ve seen the updates just now:

How does this fit into your previous findings, i.e. neither the model output, nor the parameters or the gradients were having invalid values, but the optimizer.step() caused the parameters to become NaNs?
Before I saw the other posts I was trying to reason about this issue, but wasn’t sure how this could be happening.

Sorry I didn’t get back to you before now on this.

How does this fit into your previous findings, i.e. neither the model output, nor the parameters or the gradients were having invalid values, but the `optimizer.step()` caused the parameters to become NaNs?
Before I saw the other posts I was trying to reason about this issue, but wasn’t sure how this could be happening.

You are right that part still doesn’t make sense. I see two possible scenarios for what could have happened. 1) I somehow must have made a mistake when I was checking the regular output for NaNs, 2) The way I was checking was to find an error and reload the model to previous epoch, but I didn’t reload the optimizer (ADAM) which I believe has some kind of momentum history in it (not too sure on how ADAM works), so maybe the momentum in the optimizer was still NaN, which corrupted the optimizer step?

(I have since fixed the issue, and have been running the code for 7 days in a bunch of different tests and they are all super stable now, so it works great :))