RuntimeError: Function 'PowBackward1' returned nan values in its 1st output

Hawk · May 21, 2020, 10:31am

I defined a loss function for my network, and it’s computed as:

    loss = pred.mm(pred.T) - (label.view(-1,1) == label.view(1,-1)).to(dtype=torch.float32)
    loss = loss**2
    loss = torch.pow(loss, (2-loss)**2)
    loss = loss.sum() / (label.size(0)**2 +1.0)    
    return loss

During training, the following error report appeared:
RuntimeError: Function 'PowBackward1' returned nan values in its 1st output.
and this error is traced to the line:
loss = torch.pow(loss, (2-loss)**2)
I’m not clear about how this error is related to this line of code. Can anyone help me out?

albanD · May 21, 2020, 2:44pm

Hi,

What is the value of the loss when it happens?
This is most likely that you reach a value of the loss for which loss.pow((2-loss)**2) is not differentiable.

Hawk · May 27, 2020, 9:32am

Hi, I rewrote the loss and now it’s computed as:

    loss = pred.mm(pred.T) - (label.view(-1,1) == label.view(1,-1)).to(dtype=torch.float32)
    loss = (loss**2).sum()
    loss = loss / (label.size(0)**2 +1.0)
    
    return loss

where pred is computed as:

pred = 1.0 / (1.0 + (-pred).exp())

however, another error occured:

RuntimeError: Function 'ExpBackward' returned nan values in its 0th output.

which is traced to the ‘.exp’ operation.
The loss value is 0.980656 when this happened. Can you figure this out?

Hawk · May 27, 2020, 11:07am

The error disappeared after I reduced the initial learning rate by half. But I’m still not clear about the mechanism behind.

albanD · May 27, 2020, 11:08am

In that case, what most likely happens is that the value of pred goes to infinite and thus leads to nan in gradients.

xianqian · June 14, 2021, 7:18am

Hello @Hawk ,
You mentioned that “error is traced to this line”. Can you share how you traced it back?