Getting Nan after first iteration with custom loss

aam541 · September 25, 2018, 6:56pm

I have a complex model that calculates the low-rank matrix and try to minimize it while training CNN. The training wrapper is the following:

def train_generalization(args, modelc, model1, model2, device, train_loader_combined, optimizer, epoches, criterion,batch_size):
for epoch in range(epoches): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(train_loader_combined, 0):
input_combined,labels_combined, flag = data
labels_combined,index =torch.sort(labels_combined)
input_combined=input_combined[index]
flag = flag[index]
input_combined, labels_combined, flag = input_combined.to(device), labels_combined.to(device), flag.to(device)
optimizer.zero_grad()
Hc, outputs1 = modelc(input_combined) # domain invariant representation
Lg = criterion(outputs1, labels_combined)
Hs1, outputs2 = model1(input_combined[flag==1])
Hs2, outputs3 = model2(input_combined[flag==2])
max_iteration = 1000
if (epoch >= 2):
max_iteration = 1000
Hs = torch.cat((Hs1, Hs2), 0)
labels_Hs = torch.cat((labels_combined[flag==1], labels_combined[flag==2]), 0)
labels_Hs,index = torch.sort(labels_Hs)
Hs = Hs[index]
Q = get_Q(labels_combined, labels_combined, batch_size)
Z,ZZ,E = calculate_Z(torch.transpose(Hc,0,1),torch.transpose(Hs,0,1), Q, device, batch_size)
Lr = get_nuc_norm(Z)+ get_fib_norm(Z-Q)
Lr = Lr.to(device)
loss = Lr
loss.backward()
optimizer.step()
print(Lr)

My CNN is as the following:

def __init__(self):
    super(model_gen, self).__init__()
    self.conv1 = nn.Conv2d(1, 10, 5)
    self.conv2 = nn.Conv2d(10, 20, 5)
    self.conv2_drop = nn.Dropout2d()
    self.fc1 = nn.Linear(20 * 5 * 5, 120)
    self.fc2 = nn.Linear(120, 84)
    self.fc3 = nn.Linear(84, 10)

def forward(self, x):
    x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
    x = F.max_pool2d(F.relu(self.conv2(x)), 2)
    x= self.conv2_drop(x)
    x = x.view(-1, self.num_flat_features(x))
    x = self.fc1(x) # matrix of 20*5*,120
    x1 = self.fc2(x) # vector of 84
    x2 = self.fc3(x1) # vector of 10 which are the number of classes
    x4 = F.relu(x2)

For some reason, after the first iteration, the fc1 and fc2 wights gives me NAN while fc3 gives me normal weights

Can anybody direct me on how to figure out the problem?

ptrblck · September 25, 2018, 11:17pm

Currently your model uses the linear layers (fc1, fc2, fc3) without a non-linearity between them, so basically it’s just one single linear transformation. Is this on purpose or did you forget to add the relu or another activation function?
Might be unrelated to this issue, but might be worth a try as a first approach.

aam541 · September 25, 2018, 11:51pm

Thank you for your replay

You are right, I did that on purpose because I am trying to mimic a paper that explained the network in this way. However, I tried to add non-linearty between them ,but unfortunately didn’t fix the NaN error.

debugging the code, I notice the NaN appears in the weights of the model after I call the optimizer()

ptrblck · September 25, 2018, 11:59pm

Could you check the gradients in the layers which have the NANs after the update?
You can print them with print(model.fc1.weight.grad).

aam541 · September 26, 2018, 12:05am

Sure, I printed the gradient after the backward() and it shows this:
tensor([[nan, nan, nan, …, nan, nan, nan],
[nan, nan, nan, …, nan, nan, nan],
[nan, nan, nan, …, nan, nan, nan],
…,
[nan, nan, nan, …, nan, nan, nan],
[nan, nan, nan, …, nan, nan, nan],
[nan, nan, nan, …, nan, nan, nan]], device=‘cuda:0’)

ptrblck · September 26, 2018, 12:07am

OK, thanks. Then we would need to see the loss function to track down these nasty NANs.

aam541 · September 26, 2018, 12:13am

Sure,
You can see the loss function in here: https://www.dropbox.com/s/2kh8lnkrt41rgbv/lslrrd_pytorch.py?dl=0
which seems pretty complicated (sorry about that).

I also have my own norm functions as follow:
def get_fib_norm(A):
B = torch.sqrt(torch.trace(torch.mm(torch.transpose(A,0,1),A)))
return B

def get_nuc_norm(A):
B = torch.trace(torch.sqrt(torch.mm(A,A)))
return B

def get_infinity_norm(A):
B = torch.max(torch.sum(torch.abs(A),dim=1))
return B

def get_spec_norm(A):
l1, B, l2 = torch.svd(A,some=False)
C = torch.max(B)
return C

ptrblck · September 26, 2018, 12:32am

As your script is quite complicated, you could try to build PyTorch from source and try out the anomaly detection, which will try to get the method causing the NANs.
You’ll find the build instructions here.
Let me know, if you encounter any problems.

Alternatively, you could create an executable code snippet and I could try to run it on my machine.

aam541 · September 26, 2018, 12:47am

Thank you

I will try the anomaly detection and let you know what I find. If I couldn’t find out what is causing the problem maybe I will give you the executable code

Thank you for your time and help again

Pengbo_Ma · October 30, 2018, 12:15am

I guess it is because you have zeros in your sqrt which causes a nan in back prop

jbohnslav · April 4, 2019, 11:21pm

Thanks for pointing out anomaly detection! This was very helpful in finding where nans were coming from in my custom loss function. Protip: adding a tiny epsilon where you’re dividing or taking square roots will probably do the trick.

t.ouyang · December 17, 2019, 1:51pm

Here is a way of debuging the nan problem.
First, print your model gradients because there are likely to be nan in the first place.
And then check the loss, and then check the input of your loss…Just follow the clue and you will find the bug resulting in nan problem.

There are some useful infomation about why nan problem could happen:
1.the learning rate
2.sqrt(0)
3.ReLU->LeakyReLU

aribryan · December 19, 2019, 12:15pm

I was using torch.reciprocal_ (for element wise reciprocal) function at some point and I had to add a small epsilon to get rid of the ‘nan’ loss value. Thank you for the overall discussion.

azeeshan · April 19, 2020, 8:18am

Is there a way to have the anomaly detection on by default? I want to avoid inserting with autograd.detect_anomaly(): in different parts of the code.

ptrblck · April 19, 2020, 8:57am

You can add torch.autograd.set_detect_anomaly(True) at the beginning of the script to enable it globally.

Saurabh_Kataria · May 3, 2020, 9:18pm

this solved one of my problem. thanks!

changwn · April 20, 2021, 1:12pm

@t.ouyang Thanks. Your advice solve my NaN problem.
My problem caused due to sqrt(0).
#------------------------------------
There are some useful infomation about why nan problem could happen:
1.the learning rate
2.sqrt(0)
3.ReLU->LeakyReLU

Sentient07 · February 28, 2022, 1:58pm

Hello.

I face exactly the same issue. A similarity with OP’s post is that we both seem to compute norm. As @Pengbo_Ma mentioned, this seems to vanish when I add a small epsilon to linalg.matrix_norm(). Ideally, this should be supported as a parameter of the function and/or pytorch must handle it internally. @ptrblck can you please comment on what is the best practise ?

ptrblck · February 28, 2022, 7:58pm

I think adding a small eps value to the input matrix is the right approach as it’s explicitly guarding against invalid gradients as seen e.g. here:

x = torch.zeros(4, 4, requires_grad=True)
out = torch.linalg.matrix_norm(x)
out.backward()
print(x.grad)
# > tensor([[nan, nan, nan, nan],
#          [nan, nan, nan, nan],
#          [nan, nan, nan, nan],
#          [nan, nan, nan, nan]])

You could add e.g. 1e-6 to the input to avoid the NaN values in the gradient.

duyducvo4444 · March 7, 2022, 8:41pm

Dear @ptrblck ,
Is there anyway to get the tensor’s value after set_detect_anomaly raises the error?
Specifically, it returns to me this trace

File “/home/s1910442/Project/Master3/models/ComplexVQ2_WITHOUT_PRETRAINING.py”, line 111, in si_snr
snr = 10 * torch.log10(target_norm / (noise_norm + eps) + eps)
(function _print_stack)
Traceback (most recent call last):
File “train_no_pretrain.py”, line 128, in
train(**train_config)
File “train_no_pretrain.py”, line 57, in train
loss.backward()
File “/opt/conda/lib/python3.8/site-packages/torch/_tensor.py”, line 256, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py”, line 147, in backward
Variable._execution_engine.run_backward(
RuntimeError: Function ‘Log10Backward’ returned nan values in its 0th output.

What I want to ask is how can I get the values of target_norm and noise_norm? eps here is just 1e-9