Not getting grad on a new parameter

I have a training flow where I need to add a new parameter to a module during training like below.

for epoch in range(10):
  if epoch==5:
    module.register_parameter('t',torch.nn.Parameter(torch.Tensor([0.05])))
    opt.param_groups.append(copy.deepcopy(opt.param_groups[-1]))
    opt.param_groups[-1]['params'] = [module.t]

  loss= module.forward() #use t param from epoch=5
  loss.backward()
....

But, I don’t see the parameter ‘t’ being updated, and find that module.t.grad=None. Is there any other step s I need take to add a parameter on the fly?

Could you post a minimal and executable code snippet to reproduce the issue, please?

thanks @ptrblck I cooked up a simple example. It seems grad is there, but the model.l1.t is not being updated??

is this problem specific to Adadelta? with a simple SGD, it seems working…

I’m not sure if your manual param_groups manipulation works and would recommend to use add_param_group instead:

model.l1.register_parameter('t',torch.nn.Parameter(torch.Tensor([0.05]).to(device)))
opt.add_param_group({'params': [model.l1.t]})

Afterwards I get this output:

Train Epoch: 0 [576/60000 (1%)]	Loss: -0.093652
Train Epoch: 1 [576/60000 (1%)]	Loss: -0.130155
Adadelta (
Parameter Group 0
    eps: 1e-06
    initial_lr: 0.001
    lr: 0.00049
    rho: 0.9
    weight_decay: 0

Parameter Group 1
    eps: 1e-06
    lr: 0.001
    rho: 0.9
    weight_decay: 0
)
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* None
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.1296], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.1662], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.0712], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.1003], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.0827], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.0788], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.1489], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.1556], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.1602], device='cuda:0')
Train Epoch: 2 [576/60000 (1%)]	Loss: -0.007945
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.1588], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.0828], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.1081], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.1140], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.1470], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.1953], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.1221], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.1119], device='cuda:0')
Parameter containing:
tensor([0.0500], device='cuda:0', requires_grad=True) *grad* tensor([-0.0748], device='cuda:0')
Parameter containing:
tensor([0.0501], device='cuda:0', requires_grad=True) *grad* tensor([-0.1101], device='cuda:0')
Train Epoch: 3 [576/60000 (1%)]	Loss: -0.008487

which shows that the new parameter is updated.

Apparently, both add_param_group and the way I add new params work fine. The diff is that using add_param_group will use the initial state, rather than the last state.

Adding new params with advanced optimizers seems not working properly, perhaps these optimizers are updated only the registered parameters with proper curvature information? (newly added ones don’t have any accumulated previous values).