Weight becomes nan after step but grad is normal

I have a training script where weights are consistently turning into nan. But I don’t see how its happening. I set up the following block to debug.

temp = convLayer.weight.detach().clone()
optimizer.step()
if (convLayer.weight.isnan().any()):
        import pdb; pdb.set_trace()

when the trigger gets flipped I can print:

(Pdb)  convLayer.weight.grad[convLayer.weight.isnan()]
tensor([-3.3897e-10], device='cuda:0')
(Pdb)  convLayer.weight[convLayer.weight.isnan()]
tensor([nan], device='cuda:0', grad_fn=<IndexBackward0>)
(Pdb) temp[convLayer.weight.isnan()]
tensor([-0.0122], device='cuda:0')

It looks like the index of the weights I am looking at is not nan before the optimizer.step, and nan after, but the grad in between is just a normal seeming value. The following is the state of the optimizer.

(Pdb) optimizer
RMSprop (
Parameter Group 0
    alpha: 0.95
    capturable: False
    centered: True
    differentiable: False
    eps: 0.01
    foreach: None
    lr: 0.00025
    maximize: False
    momentum: 0
    weight_decay: 0
)

1 Like

Bumping since there are no replies yet. Anyone have any thoughts?

Can you provide a replicable example? I’m not getting NaNs in my toy conv layer example.

1 Like

I updated my code to be the following:

    torch.save(optimizer, 'opt.pt')
    torch.save(optimizer.param_groups[0]['params'][8].grad, 'grad')
    optimizer.step()
    if(convLayer.weight.isnan().any()):
        import pdb; pdb.set_trace()

I’ve uploaded the optimizer and grad saves here and then you can run the following

import torch
temp = torch.load('opt.pt', weights_only=False)
grad = torch.load('grad.pt', weights_only=False)
temp.param_groups[0]['params'][8].grad = grad
print(temp.param_groups[0]['params'][8][62][83])
print(grad[62][83])
temp.step()
print(temp.param_groups[0]['params'][8][62][83])

to output

tensor(0.0085, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')
tensor(nan, device='cuda:0', grad_fn=<SelectBackward0>)

It seems that if you assign grad (as in, the name of your variable, not grad as in gradient) to temp’s .grad rather than grad’s '.grad the gradients get messed up when temp.step() is called.

so instead of:

temp = torch.load(path_to_pt_files + 'opt.pt', weights_only=False)

grad = torch.load(path_to_pt_files + 'grad.pt', weights_only=False)

print(temp.param_groups[0]['params'][8][62][83])

print(grad[62][83])

print('...')

temp.param_groups[0]['params'][8].grad = grad

print(temp.param_groups[0]['params'][8][62][83])

print(grad[62][83])

print('...')

temp.step()

print(temp.param_groups[0]['params'][8][62][83])

print(grad[62][83])

which outputs:

tensor(0.0085, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')
...
tensor(0.0085, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')
...
tensor(nan, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')

we need to assign .grad explicitly:

...
temp.param_groups[0]['params'][8].grad = grad[62][83].grad
...

which outputs:

tensor(0.0085, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')
...
tensor(0.0085, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')
...
tensor(0.0085, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')
1 Like

Thank you for the continued correspondence. But grad is the original grad tensor of
temp.param_groups[0][‘params’][8], it will not have a gradient itself.

>>> grad[62][83].grad is None
True
>>> grad.grad is None
True

The question is that temp.param_groups[0][‘params’][8] is all normal numbers, and the grad tensor is all normal numbers, why is the optimizer step causing a normal grad to push a normal parameter value to nan? Only one of the other numbers in the tensor are having any problems:

>>> temp.param_groups[0]['params'][8].isnan().sum()
tensor(2, device='cuda:0')

My initial reply was wrong.

New reply:

When I load your optimizer ('opt.pt') it has .grad as None:

grad = torch.load(path_to_pt_files + 'grad.pt', weights_only=False)
opt = torch.load(path_to_pt_files + 'opt.pt', weights_only=False)

print('grad:')
print(grad)
print(grad.grad)

print('opt:')
print(opt.param_groups[0]['params'][8].grad)
grad:
tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 2.6891e-04,  2.6891e-04,  2.6891e-04,  ...,  1.2958e-04,
          9.1939e-05,  1.3036e-04],
        [-3.6381e-06, -3.6381e-06, -3.6381e-06,  ..., -1.3383e-06,
         -1.3358e-06, -1.3876e-06],
        ...,
        [ 5.5009e-04,  5.5009e-04,  5.5009e-04,  ...,  2.4709e-04,
          2.2063e-04,  1.9834e-04],
        [-7.2875e-04, -7.2875e-04, -7.2875e-04,  ..., -3.2083e-04,
         -2.5821e-04, -3.2125e-04],
        [-3.8400e-07, -3.8400e-07, -3.8400e-07,  ..., -1.7627e-07,
         -1.3704e-07, -1.5496e-07]], device='cuda:0')
None
opt:
None

It could be because you’re saving the optimizer rather than the .state_dict, and if you want to assign some new_gradients to your already existing 'opt.pt' file:

model = nn.Sequential(
          nn.Linear(6, 2, bias=False),
          nn.Sigmoid(),
        )
inputs = torch.randn(6)
target = torch.randn(2)
criterion = nn.MSELoss()

optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
torch.save(optimizer.state_dict(), path_to_pt_files + 'example_optim.pt')
optimizer_state_dict = torch.load(path_to_pt_files + 'example_optim.pt', weights_only=False)
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
optimizer.load_state_dict(optimizer_state_dict)
print('saved optimizer.grad: ', optimizer.param_groups[0]['params'])

optimizer.zero_grad()
torch.save(optimizer.state_dict(), path_to_pt_files + 'example_optim.pt')
optimizer_state_dict = torch.load(path_to_pt_files + 'example_optim.pt', weights_only=False)
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
optimizer.load_state_dict(optimizer_state_dict)
print('saved after zeroing optimizer.grad: ', optimizer.param_groups[0]['params'])

optimizer.zero_grad()
output = model(inputs)
loss = criterion(output, target)
loss.backward()

torch.save(optimizer.state_dict(), path_to_pt_files + 'example_optim.pt')
optimizer_state_dict = torch.load(path_to_pt_files + 'example_optim.pt', weights_only=False)
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
optimizer.load_state_dict(optimizer_state_dict)
print('saved after predicting optimizer.grad: ', optimizer.param_groups[0]['params'])
optimizer.step()
torch.save(optimizer.state_dict(), path_to_pt_files + 'example_optim.pt')
optimizer_state_dict = torch.load(path_to_pt_files + 'example_optim.pt', weights_only=False)
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
optimizer.load_state_dict(optimizer_state_dict)
print('saved after stepping optimizer.grad: ', optimizer.param_groups[0]['params'])

new_gradient = torch.rand(optimizer.param_groups[0]['params'][0].grad.shape, requires_grad=True)
print(f'new_gradient: {new_gradient}')
optimizer.param_groups[0]['params'][0] = torch.nn.parameter.Parameter(new_gradient, requires_grad=True)
print('new values for optimizer.grad: ', optimizer.param_groups[0]['params'])

torch.save(optimizer.state_dict(), path_to_pt_files + 'example_optim.pt')
optimizer_state_dict = torch.load(path_to_pt_files + 'example_optim.pt', weights_only=False)
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
optimizer.load_state_dict(optimizer_state_dict)
print('new values after saving again for optimizer.grad: ', optimizer.param_groups[0]['params'])

it will cause a KeyError when saving out the optimizer again:

saved optimizer.grad:  [Parameter containing:
tensor([[ 3.9279e-01,  2.6180e-01,  3.1024e-01, -3.6962e-02, -1.4670e-01,
         -2.1468e-01],
        [-1.2634e-04, -2.9840e-01,  9.1751e-02, -1.2751e-01,  1.7776e-01,
         -2.1791e-01]], requires_grad=True)]
saved after zeroing optimizer.grad:  [Parameter containing:
tensor([[ 3.9279e-01,  2.6180e-01,  3.1024e-01, -3.6962e-02, -1.4670e-01,
         -2.1468e-01],
        [-1.2634e-04, -2.9840e-01,  9.1751e-02, -1.2751e-01,  1.7776e-01,
         -2.1791e-01]], requires_grad=True)]
saved after predicting optimizer.grad:  [Parameter containing:
tensor([[ 3.9279e-01,  2.6180e-01,  3.1024e-01, -3.6962e-02, -1.4670e-01,
         -2.1468e-01],
        [-1.2634e-04, -2.9840e-01,  9.1751e-02, -1.2751e-01,  1.7776e-01,
         -2.1791e-01]], requires_grad=True)]
saved after stepping optimizer.grad:  [Parameter containing:
tensor([[ 0.2928,  0.1618,  0.2102,  0.0630, -0.2467, -0.3147],
        [-0.1001, -0.3984, -0.0082, -0.0275,  0.0778, -0.3179]],
       requires_grad=True)]
new_gradient: tensor([[0.8659, 0.3938, 0.4791, 0.8014, 0.1843, 0.5315],
        [0.7555, 0.2915, 0.1225, 0.6326, 0.5627, 0.9159]], requires_grad=True)
new values for optimizer.grad:  [Parameter containing:
tensor([[0.8659, 0.3938, 0.4791, 0.8014, 0.1843, 0.5315],
        [0.7555, 0.2915, 0.1225, 0.6326, 0.5627, 0.9159]], requires_grad=True)]
Traceback (most recent call last):
...
...
...
(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v
KeyError: 139737484299376

Which I’m not sure why at the moment and I’ll have to come back to this. Or someone smarter than me could give us some direction.

1 Like

That is correct that with torch.save the gradient is not saved. This is why I save the grad separately and then need to assign the grad after loading. I know this isn’t particularly normal, but it is the easiest way to produce code that can be replicated with the required values in the tensor, gradient, and optimizer. The goal of the question is not to fix the problem, it is to identify why this optimizer, this tensor, and this gradient result in nan after the step so I have a starting point of debugging with the real program causing this problem.

Just commenting to bump this up, still seeking a solution.

To get to the bottom of this, we need to consider how RMSProp is calculated, given the parameters you’ve changed from the default:

You’ve changed the learning rate, alpha, eps, and centered = True. In the case of centered = True, we have the following terms in the denominator:

√(v - m²) + ε

Where v is the second moment of the gradients, m is the first moment, and ε is just eps. When v - m² < 0, we gets nans because we’re taking the square root of a negative (might still work if you use complex numbers).

This is occuring because of how you’ve set α. What α does is it takes the new gradient and makes it more weighted the lower α becomes, giving us the first moment(m above). So let’s say we have a large new gradient but small second moment. Lower α increases the likelihood of a square root of a negative number in the denominator.

Solution:
I’d try making α closer to 1(i.e. alpha = 0.999) or setting centered = False. See if that resolves the issue.