Why nan after backward pass?

Philipp_Friebertshau · April 1, 2019, 11:28pm

Hello,

I’m training a model to predict landmarks on faces. My lossfunction looks like in the following:

"
logits = model_ft(inputs)
out=torch.sigmoid(logits)
loss_temp=(torch.abs(out-target))**potenz
loss_temp[torch.isnan(target)]=0
loss=torch.mean(loss_temp)
loss.backward()
"
Not all Landmarks are everytime provided, so thats the reason I assign the loss a zero for empty Landmarks. When I use the potenz=1 it works fine, but if I change it to potenz=2 or every other value like potenz=1.0001 than the gradients of the weights of the model getting nan after the first backward pass.

Why is that?

ptrblck · April 1, 2019, 11:47pm

How large is the loss if you use a higher exponent?
Could it just be that your loss is exploding?

chenyuntc · April 1, 2019, 11:49pm

is loss nan?
I don’t know why, but you may try

logits = model_ft(inputs)
out=torch.sigmoid(logits)
mask = 1- torch.isnan(target)
out = out[mask]
target = target[mask]
loss_temp=(torch.abs(out-target))**potenz
loss=torch.mean(loss_temp)
loss.backward()

Philipp_Friebertshau · April 2, 2019, 11:10am

I tried very small learning rates and exponents like 0.0001,0.9999,1.0001. It’s only nan if I choose a different value than 0 or 1. I also checked out the gradients after the first backward pass. Therefore I don’t think the loss is exploding.

Philipp_Friebertshau · April 2, 2019, 6:35pm

This works, thanks!

Its seems that the assignment of the zeros is a problem.

chenyuntc · April 2, 2019, 7:10pm

One guideline for nan in pytorch is that: Try exclude it in autograd.

loss_temp=(torch.abs(out-target))**potenz, in this step target is stored as buffer for back prop, so it should not include nan value.