Updatation of Parameters without using optimizer.step()

In general it is better to use inplace because other components that have reference to your parameters still have valid references. If you change the Tensor, then you need to make sure to update all other places that had reference to this Tensor. Also you need to make sure that it is properly a leaf to ensure that following grad step will populate the .grad field.

1 Like

Hey @albanD,

Does this method also works in DistributedDataParallel training?
I am modifying the model weight to apply Orthogonality on them and it works fine with single GPU.
But when I am using DDP, the weights become unstable and cause assertion error in my code!

Hi,

DDP is doing a lot of extra things to sync the weights across machines. So you definitely want to use inplace in that case.

1 Like

Hello @albanD. If I want to set the gradient of Network2(base on Loss1)(the gradient is also a function of Network1) as a variable, and use the modified Network2 to compute the loss function Loss2. I want to achieve a backward chain: Loss2 → grad Loss1 ->Network1. Here is my code:


DNN1 = DNN(2, 1)
DNN2 = DNN(2, 1)
x = torch.tensor([[1.0,2.0],[3.0,4.0],[5.0,6.0],[7.0,8.0]])
opt1 = Adam(DNN1.parameters(),lr=3e-4,betas=[0.9,0.999])
opt2 = Adam(DNN2.parameters(),lr=3e-4,betas=[0.9,0.999])
loss1 = (DNN1(x)+DNN2(x)).sum()*(DNN1(x)+DNN2(x)).sum()
opt2.zero_grad()
opt1.zero_grad()
gradi = torch.autograd.grad(loss1, DNN2.parameters(), retain_graph=True,create_graph=True)
with torch.no_grad():
for p,g in zip(DNN2.parameters(),gradi):
new_val = p + g
p.copy_(new_val)
loss2 = DNN2(x).sum()
opt1.zero_grad()
loss2.backward(retain_graph=True,create_graph=True)
opt1.step()

But the parameter of new Network2 have no grad_fn if “with torch.no_grad()”, and delete it will cause “RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.”.
What should I do if I want to have this backward chain?
I’m looking forward to your reply!