Getting Nan after first iteration with custom loss

I have a complex model that calculates the low-rank matrix and try to minimize it while training CNN. The training wrapper is the following:

def train_generalization(args, modelc, model1, model2, device, train_loader_combined, optimizer, epoches, criterion,batch_size):
for epoch in range(epoches): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(train_loader_combined, 0):
input_combined,labels_combined, flag = data
labels_combined,index =torch.sort(labels_combined)
flag = flag[index]
input_combined, labels_combined, flag =,,
Hc, outputs1 = modelc(input_combined) # domain invariant representation
Lg = criterion(outputs1, labels_combined)
Hs1, outputs2 = model1(input_combined[flag==1])
Hs2, outputs3 = model2(input_combined[flag==2])
max_iteration = 1000
if (epoch >= 2):
max_iteration = 1000
Hs =, Hs2), 0)
labels_Hs =[flag==1], labels_combined[flag==2]), 0)
labels_Hs,index = torch.sort(labels_Hs)
Hs = Hs[index]
Q = get_Q(labels_combined, labels_combined, batch_size)
Z,ZZ,E = calculate_Z(torch.transpose(Hc,0,1),torch.transpose(Hs,0,1), Q, device, batch_size)
Lr = get_nuc_norm(Z)+ get_fib_norm(Z-Q)
Lr =
loss = Lr

My CNN is as the following:

def __init__(self):
    super(model_gen, self).__init__()
    self.conv1 = nn.Conv2d(1, 10, 5)
    self.conv2 = nn.Conv2d(10, 20, 5)
    self.conv2_drop = nn.Dropout2d()
    self.fc1 = nn.Linear(20 * 5 * 5, 120)
    self.fc2 = nn.Linear(120, 84)
    self.fc3 = nn.Linear(84, 10)

def forward(self, x):
    x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
    x = F.max_pool2d(F.relu(self.conv2(x)), 2)
    x= self.conv2_drop(x)
    x = x.view(-1, self.num_flat_features(x))
    x = self.fc1(x) # matrix of 20*5*,120
    x1 = self.fc2(x) # vector of 84
    x2 = self.fc3(x1) # vector of 10 which are the number of classes
    x4 = F.relu(x2)

For some reason, after the first iteration, the fc1 and fc2 wights gives me NAN while fc3 gives me normal weights

Can anybody direct me on how to figure out the problem?

Currently your model uses the linear layers (fc1, fc2, fc3) without a non-linearity between them, so basically it’s just one single linear transformation. Is this on purpose or did you forget to add the relu or another activation function?
Might be unrelated to this issue, but might be worth a try as a first approach.


Thank you for your replay

You are right, I did that on purpose because I am trying to mimic a paper that explained the network in this way. However, I tried to add non-linearty between them ,but unfortunately didn’t fix the NaN error.

debugging the code, I notice the NaN appears in the weights of the model after I call the optimizer()

Could you check the gradients in the layers which have the NANs after the update?
You can print them with print(model.fc1.weight.grad).

Sure, I printed the gradient after the backward() and it shows this:
tensor([[nan, nan, nan, …, nan, nan, nan],
[nan, nan, nan, …, nan, nan, nan],
[nan, nan, nan, …, nan, nan, nan],
[nan, nan, nan, …, nan, nan, nan],
[nan, nan, nan, …, nan, nan, nan],
[nan, nan, nan, …, nan, nan, nan]], device=‘cuda:0’)

OK, thanks. Then we would need to see the loss function to track down these nasty NANs.

You can see the loss function in here:
which seems pretty complicated (sorry about that).

I also have my own norm functions as follow:
def get_fib_norm(A):
B = torch.sqrt(torch.trace(,0,1),A)))
return B

def get_nuc_norm(A):
B = torch.trace(torch.sqrt(,A)))
return B

def get_infinity_norm(A):
B = torch.max(torch.sum(torch.abs(A),dim=1))
return B

def get_spec_norm(A):
l1, B, l2 = torch.svd(A,some=False)
C = torch.max(B)
return C

As your script is quite complicated, you could try to build PyTorch from source and try out the anomaly detection, which will try to get the method causing the NANs.
You’ll find the build instructions here.
Let me know, if you encounter any problems.

Alternatively, you could create an executable code snippet and I could try to run it on my machine.


Thank you

I will try the anomaly detection and let you know what I find. If I couldn’t find out what is causing the problem maybe I will give you the executable code

Thank you for your time and help again

I guess it is because you have zeros in your sqrt which causes a nan in back prop


Thanks for pointing out anomaly detection! This was very helpful in finding where nans were coming from in my custom loss function. Protip: adding a tiny epsilon where you’re dividing or taking square roots will probably do the trick.


Here is a way of debuging the nan problem.
First, print your model gradients because there are likely to be nan in the first place.
And then check the loss, and then check the input of your loss…Just follow the clue and you will find the bug resulting in nan problem.

There are some useful infomation about why nan problem could happen:
1.the learning rate


I was using torch.reciprocal_ (for element wise reciprocal) function at some point and I had to add a small epsilon to get rid of the ‘nan’ loss value. Thank you for the overall discussion.

Is there a way to have the anomaly detection on by default? I want to avoid inserting with autograd.detect_anomaly(): in different parts of the code.

1 Like

You can add torch.autograd.set_detect_anomaly(True) at the beginning of the script to enable it globally.


this solved one of my problem. thanks!

@t.ouyang Thanks. Your advice solve my NaN problem.
My problem caused due to sqrt(0).
There are some useful infomation about why nan problem could happen:
1.the learning rate