Weight becomes nan after step but grad is normal

pytorcher · May 20, 2025, 3:46am

I have a training script where weights are consistently turning into nan. But I don’t see how its happening. I set up the following block to debug.

temp = convLayer.weight.detach().clone()
optimizer.step()
if (convLayer.weight.isnan().any()):
        import pdb; pdb.set_trace()

when the trigger gets flipped I can print:

(Pdb)  convLayer.weight.grad[convLayer.weight.isnan()]
tensor([-3.3897e-10], device='cuda:0')
(Pdb)  convLayer.weight[convLayer.weight.isnan()]
tensor([nan], device='cuda:0', grad_fn=<IndexBackward0>)
(Pdb) temp[convLayer.weight.isnan()]
tensor([-0.0122], device='cuda:0')

It looks like the index of the weights I am looking at is not nan before the optimizer.step, and nan after, but the grad in between is just a normal seeming value. The following is the state of the optimizer.

(Pdb) optimizer
RMSprop (
Parameter Group 0
    alpha: 0.95
    capturable: False
    centered: True
    differentiable: False
    eps: 0.01
    foreach: None
    lr: 0.00025
    maximize: False
    momentum: 0
    weight_decay: 0
)

pytorcher · May 21, 2025, 4:38pm

Bumping since there are no replies yet. Anyone have any thoughts?

Ori_Yarden · May 21, 2025, 8:57pm

Can you provide a replicable example? I’m not getting NaNs in my toy conv layer example.

pytorcher · May 21, 2025, 10:27pm

I updated my code to be the following:

    torch.save(optimizer, 'opt.pt')
    torch.save(optimizer.param_groups[0]['params'][8].grad, 'grad')
    optimizer.step()
    if(convLayer.weight.isnan().any()):
        import pdb; pdb.set_trace()

I’ve uploaded the optimizer and grad saves here and then you can run the following

import torch
temp = torch.load('opt.pt', weights_only=False)
grad = torch.load('grad.pt', weights_only=False)
temp.param_groups[0]['params'][8].grad = grad
print(temp.param_groups[0]['params'][8][62][83])
print(grad[62][83])
temp.step()
print(temp.param_groups[0]['params'][8][62][83])

to output

tensor(0.0085, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')
tensor(nan, device='cuda:0', grad_fn=<SelectBackward0>)

Ori_Yarden · May 22, 2025, 7:29pm

It seems that if you assign grad (as in, the name of your variable, not grad as in gradient) to temp’s .grad rather than grad’s '.grad the gradients get messed up when temp.step() is called.

so instead of:

temp = torch.load(path_to_pt_files + 'opt.pt', weights_only=False)

grad = torch.load(path_to_pt_files + 'grad.pt', weights_only=False)

print(temp.param_groups[0]['params'][8][62][83])

print(grad[62][83])

print('...')

temp.param_groups[0]['params'][8].grad = grad

print(temp.param_groups[0]['params'][8][62][83])

print(grad[62][83])

print('...')

temp.step()

print(temp.param_groups[0]['params'][8][62][83])

print(grad[62][83])

which outputs:

tensor(0.0085, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')
...
tensor(0.0085, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')
...
tensor(nan, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')

we need to assign .grad explicitly:

...
temp.param_groups[0]['params'][8].grad = grad[62][83].grad
...

which outputs:

tensor(0.0085, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')
...
tensor(0.0085, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')
...
tensor(0.0085, device='cuda:0', grad_fn=<SelectBackward0>)
tensor(-6.5587e-22, device='cuda:0')

pytorcher · May 23, 2025, 12:17pm

Thank you for the continued correspondence. But grad is the original grad tensor of
temp.param_groups[0][‘params’][8], it will not have a gradient itself.

>>> grad[62][83].grad is None
True
>>> grad.grad is None
True

The question is that temp.param_groups[0][‘params’][8] is all normal numbers, and the grad tensor is all normal numbers, why is the optimizer step causing a normal grad to push a normal parameter value to nan? Only one of the other numbers in the tensor are having any problems:

>>> temp.param_groups[0]['params'][8].isnan().sum()
tensor(2, device='cuda:0')

Ori_Yarden · May 23, 2025, 8:45pm

My initial reply was wrong.

New reply:

When I load your optimizer ('opt.pt') it has .grad as None:

grad = torch.load(path_to_pt_files + 'grad.pt', weights_only=False)
opt = torch.load(path_to_pt_files + 'opt.pt', weights_only=False)

print('grad:')
print(grad)
print(grad.grad)

print('opt:')
print(opt.param_groups[0]['params'][8].grad)

grad:
tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 2.6891e-04,  2.6891e-04,  2.6891e-04,  ...,  1.2958e-04,
          9.1939e-05,  1.3036e-04],
        [-3.6381e-06, -3.6381e-06, -3.6381e-06,  ..., -1.3383e-06,
         -1.3358e-06, -1.3876e-06],
        ...,
        [ 5.5009e-04,  5.5009e-04,  5.5009e-04,  ...,  2.4709e-04,
          2.2063e-04,  1.9834e-04],
        [-7.2875e-04, -7.2875e-04, -7.2875e-04,  ..., -3.2083e-04,
         -2.5821e-04, -3.2125e-04],
        [-3.8400e-07, -3.8400e-07, -3.8400e-07,  ..., -1.7627e-07,
         -1.3704e-07, -1.5496e-07]], device='cuda:0')
None
opt:
None

It could be because you’re saving the optimizer rather than the .state_dict, and if you want to assign some new_gradients to your already existing 'opt.pt' file:

model = nn.Sequential(
          nn.Linear(6, 2, bias=False),
          nn.Sigmoid(),
        )
inputs = torch.randn(6)
target = torch.randn(2)
criterion = nn.MSELoss()

optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
torch.save(optimizer.state_dict(), path_to_pt_files + 'example_optim.pt')
optimizer_state_dict = torch.load(path_to_pt_files + 'example_optim.pt', weights_only=False)
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
optimizer.load_state_dict(optimizer_state_dict)
print('saved optimizer.grad: ', optimizer.param_groups[0]['params'])

optimizer.zero_grad()
torch.save(optimizer.state_dict(), path_to_pt_files + 'example_optim.pt')
optimizer_state_dict = torch.load(path_to_pt_files + 'example_optim.pt', weights_only=False)
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
optimizer.load_state_dict(optimizer_state_dict)
print('saved after zeroing optimizer.grad: ', optimizer.param_groups[0]['params'])

optimizer.zero_grad()
output = model(inputs)
loss = criterion(output, target)
loss.backward()

torch.save(optimizer.state_dict(), path_to_pt_files + 'example_optim.pt')
optimizer_state_dict = torch.load(path_to_pt_files + 'example_optim.pt', weights_only=False)
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
optimizer.load_state_dict(optimizer_state_dict)
print('saved after predicting optimizer.grad: ', optimizer.param_groups[0]['params'])
optimizer.step()
torch.save(optimizer.state_dict(), path_to_pt_files + 'example_optim.pt')
optimizer_state_dict = torch.load(path_to_pt_files + 'example_optim.pt', weights_only=False)
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
optimizer.load_state_dict(optimizer_state_dict)
print('saved after stepping optimizer.grad: ', optimizer.param_groups[0]['params'])

new_gradient = torch.rand(optimizer.param_groups[0]['params'][0].grad.shape, requires_grad=True)
print(f'new_gradient: {new_gradient}')
optimizer.param_groups[0]['params'][0] = torch.nn.parameter.Parameter(new_gradient, requires_grad=True)
print('new values for optimizer.grad: ', optimizer.param_groups[0]['params'])

torch.save(optimizer.state_dict(), path_to_pt_files + 'example_optim.pt')
optimizer_state_dict = torch.load(path_to_pt_files + 'example_optim.pt', weights_only=False)
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
optimizer.load_state_dict(optimizer_state_dict)
print('new values after saving again for optimizer.grad: ', optimizer.param_groups[0]['params'])

it will cause a KeyError when saving out the optimizer again:

saved optimizer.grad:  [Parameter containing:
tensor([[ 3.9279e-01,  2.6180e-01,  3.1024e-01, -3.6962e-02, -1.4670e-01,
         -2.1468e-01],
        [-1.2634e-04, -2.9840e-01,  9.1751e-02, -1.2751e-01,  1.7776e-01,
         -2.1791e-01]], requires_grad=True)]
saved after zeroing optimizer.grad:  [Parameter containing:
tensor([[ 3.9279e-01,  2.6180e-01,  3.1024e-01, -3.6962e-02, -1.4670e-01,
         -2.1468e-01],
        [-1.2634e-04, -2.9840e-01,  9.1751e-02, -1.2751e-01,  1.7776e-01,
         -2.1791e-01]], requires_grad=True)]
saved after predicting optimizer.grad:  [Parameter containing:
tensor([[ 3.9279e-01,  2.6180e-01,  3.1024e-01, -3.6962e-02, -1.4670e-01,
         -2.1468e-01],
        [-1.2634e-04, -2.9840e-01,  9.1751e-02, -1.2751e-01,  1.7776e-01,
         -2.1791e-01]], requires_grad=True)]
saved after stepping optimizer.grad:  [Parameter containing:
tensor([[ 0.2928,  0.1618,  0.2102,  0.0630, -0.2467, -0.3147],
        [-0.1001, -0.3984, -0.0082, -0.0275,  0.0778, -0.3179]],
       requires_grad=True)]
new_gradient: tensor([[0.8659, 0.3938, 0.4791, 0.8014, 0.1843, 0.5315],
        [0.7555, 0.2915, 0.1225, 0.6326, 0.5627, 0.9159]], requires_grad=True)
new values for optimizer.grad:  [Parameter containing:
tensor([[0.8659, 0.3938, 0.4791, 0.8014, 0.1843, 0.5315],
        [0.7555, 0.2915, 0.1225, 0.6326, 0.5627, 0.9159]], requires_grad=True)]
Traceback (most recent call last):
...
...
...
(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v
KeyError: 139737484299376

Which I’m not sure why at the moment and I’ll have to come back to this. Or someone smarter than me could give us some direction.