NaN in Loss Function after 13 epochs

ThiruRJST · July 24, 2022, 6:49am

I’m training EfficientNetV2_B3 model with my data, I have tried so many times but after 13 epochs, the loss becomes NaN. Until the 13th epoch, the model was training and validating good.

Dataset: Custom
Model: EfficientNetV2_B3
Loss Function: nn.CrossEntropyLoss
Batch size: 32
Training Mode: Mixed Precision
LR Scheduler: ReduceLROnPlateau - Default settings
Optimizer: Adam
Initial Learning rate: 0.01

From 14th epoch to 17th epoch, both the training and validation loss remains NaN. can anyone could help me with this?

Thanks in advance,
Thiru

fpozzi · July 24, 2022, 7:20am

Hi @ThiruRJST,

first of all, I would check the input and the output (of the model) to see if they are originating the NaN values.

ThiruRJST · July 24, 2022, 7:26am

okay are you asking me to check the dataset with current setup of augmentations?

ThiruRJST · July 24, 2022, 7:28am

what should I do if its output?

fpozzi · July 24, 2022, 7:33am

okay are you asking me to check the dataset with current setup of augmentations?

Look at the batch that produces NaN values, inspecting the input and the output of the model: maybe the loss value is NaN because the prediction contains some NaN values.

ThiruRJST · September 27, 2022, 7:03am

@fpozzi I tried as you instructed, the inputs weren’t NaN, only the outputs of the model was Nan and that’s the reason for loss becoming NaN. But I couldn’t find which layer of the model rises this issue.