Unpredictable nan losses during training

I keep getting nan losses during training in a very unpredictable way, after the first one all the parameters in the model become nan, forcing me to stop the training and start again.
I noticed that when the length of the Dataloader is bigger i.e. the Dataset im using is larger, the problem seems to start earlier, when i use a smaller dataset everything works as expected. I don’t understand how the length of the Dataloader can affect the training behavior, given that nothing else has changed.
I checked the output of the Dataloader multiple times and found no problem, so my guess is that at some point in the training one of the parameters in the model changes to nan than propagates through the model. The network has Conv, GRU, Relu and Sigmoid layers, using the AdamW optimizer with weight decay. I can’t share my code here because it’s kind of huge.

what causes such problems usually ? and how should i try solving it ?

my only idea was to reset the training to the latest completed epoch if nan is detected, it works but it’s not efficient at all, and sometimes causes the training to get stuck in a loop.

To my knowledge, when you have the loss as NaN, it means your loss diverges a lot to larger values. I would assume this might be caused by GRU layers.

This is related to the length of the Dataloader because you aren’t updating the learning rate. I would assume this is happening in the first few epochs of training.

Having a simple code snippet of your network, train procedure, and some context of your current task would be helpful to suggest some fixes.

Thank you for your help.

I ran some tests and figured out that the first nan always show up in the calculated gradient i.e. after loss.backward() even though the loss is not nan.
I also tested the Training with the built in mse function and it didn’t have any issues, so there is a problem with the Loss function I implemented, which is in the code snippet below, i still can’t figure out what the problem with it is and why would the gradient sometimes be nan?

def Loss_abs_phase(dnn_output:torch.Tensor,Target:torch.Tensor): 
    c=0.3  # compression factor
    a=0.3  

    dnn_output[dnn_output<1e-10] = 1e-10
    Target[Target<1e-10] = 1e-10

    ##### save phase  #####
    output_phase=dnn_output.angle()
    Target_phase=Target.angle()

    ##### save magnitude #####
    output_abs=dnn_output.abs()
    Target_abs=Target.abs()

    ##### magnitude compression ####
    output_abs=output_abs.pow(c)
    Target_abs=Target_abs.pow(c)

    # MSE_abs= sum( | |Target|^c - |output|^c |^2 )/N
    MSE_abs=F.mse_loss(output_abs,Target_abs)

    # MSE_angle = sum(| (|Target|^c)*exp(j*Target_phase) - (|output|^c)*exp(j*output_phase)) |^2 )/N
    # *2 at the end because the Imaginary part is saved in a seperat dimension doubling the number of elements
    MSE_angle=F.mse_loss(Target_abs*new_exp(Target_phase),output_abs*new_exp(output_phase))*2
    
    return (1-a) * MSE_abs + a * MSE_angle

the new_exp() function is defined as follows ,

def new_exp(angle):
    '''
    out[0, ...] ---> Real part 
    out[1, ...] ---> Imaginary part
    '''
    if angle.dim()==3:
        angle=angle.unsqueeze(0)
    return torch.concat([torch.cos(angle),torch.sin(angle)],dim=0)

In the training i only have the usual steps :

  1. load data.
  2. compute output
  3. compute loss
  4. loss.backward()
  5. optimizer.step()
  6. optimizer.zero_grad()

and they all worked fine with the built in mse.