Loss becomes NaN when introducing regularization

Jacob_Eriksen · March 26, 2024, 3:20pm

I have a model that trains well without any regularization, however, when I implement L2 regularization (by the weight_decay in adam optimizer), the loss becomes nan after some iterations. I have tried different values of weight_decay in [0.1, 0.01, 0.00001] and tried to lower the learning rate, but it doesn’t seem to help.

The loss function does seem to decrease nicely up until some point where the loss becomes nan, and when lowering the learning rate, the convergence will go slower but the nan loss will still occur just after more iterations. The nan loss does seem to come roughly around the same loss value.

Does anyone have any suggestions as to why this happens? or what I could try to prevent it?

Notice, the model is a highly overparameterized model if that makes any difference