NaN in Loss Function after 13 epochs

I’m training EfficientNetV2_B3 model with my data, I have tried so many times but after 13 epochs, the loss becomes NaN. Until the 13th epoch, the model was training and validating good.

Dataset: Custom
Model: EfficientNetV2_B3
Loss Function: nn.CrossEntropyLoss
Batch size: 32
Training Mode: Mixed Precision
LR Scheduler: ReduceLROnPlateau - Default settings
Optimizer: Adam
Initial Learning rate: 0.01

From 14th epoch to 17th epoch, both the training and validation loss remains NaN. can anyone could help me with this?

Thanks in advance,
Thiru

Hi @ThiruRJST,

first of all, I would check the input and the output (of the model) to see if they are originating the NaN values.

okay are you asking me to check the dataset with current setup of augmentations?

what should I do if its output?

okay are you asking me to check the dataset with current setup of augmentations?

Look at the batch that produces NaN values, inspecting the input and the output of the model: maybe the loss value is NaN because the prediction contains some NaN values.

@fpozzi I tried as you instructed, the inputs weren’t NaN, only the outputs of the model was Nan and that’s the reason for loss becoming NaN. But I couldn’t find which layer of the model rises this issue.