How can the model updates weights under conditions that loss=nan in some batches?

Chu-i_Yang · March 18, 2021, 10:34am

Hi,
I have a custom loss function which contains log(y) in it.
y is one of the model outputs and I don’t want to restrict y to be positive through any activation function.

So, it is possible that loss returns ‘nan’ because of log(y) for y < 0.

Somehow, my model returns ‘nan’ in some batches but it can keep on training until it converges.

截圖 2021-03-18 下午5.45.45

I wonder how the model updates its weights when the loss returns ‘nan’…?

Thank you

ptrblck · March 19, 2021, 6:14am

I don’t know how the model is updated exactly, but you could check all .grad attributes after the loss.backward() call and see, if there are nan values.
Generally, a nan loss could break your model as seen here:

torch.manual_seed(2809)

x = torch.randn(10, requires_grad=True)
optimizer = torch.optim.SGD([x], lr=1.)

y = torch.log(x)
y = y * y
loss = y.mean()
print(loss)
> tensor(nan, grad_fn=<MeanBackward0>)

loss.backward()
print(x.grad)
> tensor([    nan, -0.0501,     nan, -0.0420,     nan,     nan,     nan,     nan,
        -0.2245,     nan])

optimizer.step()
print(x)
> tensor([   nan, 0.8653,    nan, 0.8805,    nan,    nan,    nan,    nan, 0.7679,
           nan], requires_grad=True)

Are you sure the optimizer.step() is indeed performed or does your code have specific guards to avoid it?

Chu-i_Yang · March 21, 2021, 4:44am

Thank you so much for your reply!
Your sample code gave me inspiration for the following toy model

import torch
torch.manual_seed(2810)
# input
x = torch.randn(5, requires_grad=True)

# model
class simple(torch.nn.Module):
    def __init__(self):
        super(simple, self).__init__()
        self.layer = torch.nn.Linear(5, 1, bias=False)
    def forward(self, in_):
        return self.layer(in_)
model = simple()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
model.zero_grad()

y = torch.log(model(x))
loss = y

print('input x: ', x)
> input x: tensor([-0.6195,  1.2090,  0.7255,  0.3041, -1.8422], requires_grad=True)
print('layer weights: ', model.layer.weight.data)
> layer weights: tensor([[ 0.0566,  0.3471,  0.0256, -0.3857,  0.2482]])
print('loss: ', loss)
> loss: tensor([nan], grad_fn=<LogBackward>)

then I calculated gradients and backpropagated

loss.backward()
print('layer weights: ', model.layer.weight.data)
> layer weights: tensor([[ 0.0566,  0.3471,  0.0256, -0.3857,  0.2482]])
print('layer weights grad: ', model.layer.weight.grad)
> layer weights grad: tensor([[ 3.6150, -7.0544, -4.2334, -1.7744, 10.7490]])

optimizer.step()
print('layer weights: ', model.layer.weight.data)
> layer weights: tensor([[ 0.0530,  0.3542,  0.0298, -0.3840,  0.2374]])

The model could obtain gradients of the layer weights and update the weights even if the loss is ‘nan’!
I think the optimizer has another strategy to update the model’s weights when receiving a ‘nan’ loss,
but I’m still looking for the answer…

Thank you so much again for your kind reply, you really saved my time!