Normally training loss suddenly returns to random initialization level

I’m training an over-parameterized SuperNet, which is a Res-UNet model for denoising task. The training goes well for the first 200,000 batches. The loss descends to a small value(about 80). However, it suddenly returns to the level right after random initialization(about 3000) in several batches, and keeps this level for thousands batches without descent. If I keep training, after tens of thousands of batches, the loss sometimes may descend again, or sometimes may change to NaN. I have tried grad_clip and small learning rate, but useless. Does anyone know what’s wrong with it :dizzy_face: