Optimizer not updating the weights/parameters

Vinayak_Vijay1 · March 29, 2022, 7:22am

I am using ADAM with LBFGS. The loss doesn’t change with each epoch when I try to use optimizer.step() with the closure function. If I use only ADAM with optimizer.step(), the loss function converges (albeit slowly which is why i decided to use LBFGS). Can you tell me where is my code wrong?

optimizer1 = torch.optim.Adam(net.parameters(),lr = 0.0001)
optimizer2 = torch.optim.LBFGS(net.parameters(),lr=0.001)
## Training
iterations = 10
loss_array = np.zeros((iterations))

for epoch in range(iterations):
  def closure():
    optimizer1.zero_grad() # to make the gradients zero
    optimizer2.zero_grad() # to make the gradients zero

    
    # # Data driven/boundary loss
    # pt_x_bc1 = Variable(torch.from_numpy(x_bc1).float(), requires_grad=False).to(device)
    # pt_x_bc2 = Variable(torch.from_numpy(x_bc2).float(), requires_grad=False).to(device)
    # pt_u_bc = Variable(torch.from_numpy(u_bc).float(), requires_grad=False).to(device)
    # net_bc_out1 = net(pt_x_bc1) 
    # net_bc_out2 = net(pt_x_bc2) 
    # mse_u1 = mse_cost_function(net_bc_out1, pt_u_bc)
    # mse_u2 = mse_cost_function(net_bc_out2, pt_u_bc)
    # mse_u = mse_u1 + mse_u2

    ## Physics informed loss
    all_zeros = np.zeros((500,1))
    pt_x_collocation = Variable(torch.from_numpy(x_collocation).float(), requires_grad=True).to(device)
    pt_all_zeros = Variable(torch.from_numpy(all_zeros).float(), requires_grad=False).to(device)
    f_out = f(pt_x_collocation, net) # output of f(x,t)
    mse_pinn = mse_cost_function(f_out, pt_all_zeros)
  
    ## Training data loss
    u_train = net(pt_x_collocation)
    pt_u_true = Variable(torch.from_numpy(u_true).float(), requires_grad=False).to(device)
    mse_training = mse_cost_function(u_train, pt_u_true)

    # Combining the loss functions
    loss = mse_pinn + mse_training
    loss_array[epoch] = loss
    loss.backward() 
    return loss

  if epoch<5000:
    optimizer1.step(closure)
  else: 
    optimizer2.step(closure)

  with torch.autograd.no_grad():
    print(epoch,"Traning Loss:",loss.data)

This is the output:

0 Traning Loss: tensor(0.4883)
1 Traning Loss: tensor(0.4883)
2 Traning Loss: tensor(0.4883)
3 Traning Loss: tensor(0.4883)
4 Traning Loss: tensor(0.4883)
5 Traning Loss: tensor(0.4883)
6 Traning Loss: tensor(0.4883)
7 Traning Loss: tensor(0.4883)
8 Traning Loss: tensor(0.4883)
9 Traning Loss: tensor(0.4883)

Thanks

ptrblck · March 29, 2022, 8:21am

You are detaching the output from the computation graph by rewrapping it into a deprecated Variable here:

pt_u_true = Variable(torch.from_numpy(u_true).float(), requires_grad=False).to(device)

However, based on this code snippet it seems you are also using numpy arrays, which won’t be attached to the computation graph by Autograd in the first place.
If you need to use other libraries such as numpy, you would need to write custom autograd.Functions and implement the backward method manually.

That doesn’t seem right, as no training should happen.

Vinayak_Vijay1 · March 29, 2022, 6:11pm

The training is taking place when i use ADAM and just perform optimizer.step(). Here’s the code:

## Training
iterations = 10
optimizer1 = torch.optim.Adam(net.parameters(),lr = 0.01)
loss_array = np.zeros((iterations))

for epoch in range(iterations):
  optimizer1.zero_grad() # to make the gradients zero

  ## Physics informed loss
  all_zeros = np.zeros((500,1))
  pt_x_collocation = Variable(torch.from_numpy(x_collocation).float(), requires_grad=True).to(device)
  pt_all_zeros = Variable(torch.from_numpy(all_zeros).float(), requires_grad=False).to(device)
  f_out = f(pt_x_collocation, net) # output of f(x,t)
  mse_pinn = mse_cost_function(f_out, pt_all_zeros)

  ## Training data loss
  u_train = net(pt_x_collocation)
  pt_u_true = Variable(torch.from_numpy(u_true).float(), requires_grad=False).to(device)
  mse_training = mse_cost_function(u_train, pt_u_true)

  # Combining the loss functions
  loss = mse_pinn + mse_training
  loss_array[epoch] = loss
  loss.backward() 

  optimizer1.step()
    
  with torch.autograd.no_grad():
    print(epoch,"Traning Loss:",loss.data)

The output for just 10 epochs is:
0 Traning Loss: tensor(0.7643)
1 Traning Loss: tensor(0.7007)
2 Traning Loss: tensor(0.6511)
3 Traning Loss: tensor(0.6144)
4 Traning Loss: tensor(0.5886)
5 Traning Loss: tensor(0.5705)
6 Traning Loss: tensor(0.5555)
7 Traning Loss: tensor(0.5401)
8 Traning Loss: tensor(0.5223)
9 Traning Loss: tensor(0.5022)

ptrblck · March 29, 2022, 6:42pm

Could you check the gradients of the parameters after the backward call?
I would assume they are all set to zero since the optimizer.zero_grad() would be resetting them.
I don’t see where the computation graph should be coming from in your code snippet based on my previous statements or a) detaching the graph explicitly and b) using numpy.

Vinayak_Vijay1 · March 30, 2022, 5:32am

Hi, I have printed the gradients of first hidden layer to show that they are indeed being calculated when i use optimizer.step(). Here’s the code snippet and below that is the output for the first few epochs:

## Training
iterations = 10
optimizer1 = torch.optim.Adam(net.parameters(),lr = 0.01)
loss_array = np.zeros((iterations))

for epoch in range(iterations):
  optimizer1.zero_grad() # to make the gradients zero

  ## Physics informed loss
  all_zeros = np.zeros((500,1))
  pt_x_collocation = Variable(torch.from_numpy(x_collocation).float(), requires_grad=True).to(device)
  pt_all_zeros = Variable(torch.from_numpy(all_zeros).float(), requires_grad=False).to(device)
  f_out = f(pt_x_collocation, net) # output of f(x,t)
  mse_pinn = mse_cost_function(f_out, pt_all_zeros)

  ## Training data loss
  u_train = net(pt_x_collocation)
  pt_u_true = Variable(torch.from_numpy(u_true).float(), requires_grad=False).to(device)
  mse_training = mse_cost_function(u_train, pt_u_true)

  # Combining the loss functions
  loss = mse_pinn + mse_training
  loss_array[epoch] = loss
  loss.backward() 

  # for name, param in net.named_parameters():
  #   if param.requires_grad:
  #       print(name, param.grad)
  print('Hidden layer 1 weights gradient:',net.hidden_layer1.weight.grad)
  optimizer1.step()
    
  with torch.autograd.no_grad():
    print(epoch,"Traning Loss:",loss.data)

The output is:
Hidden layer 1 weights gradient: tensor([[ 4.5100e-07],
[ 8.3037e-08],
[-3.1868e-09],
[-1.2420e-08],
[ 1.9485e-07],
[ 1.8525e-08],
[-3.7920e-07],
[-3.4814e-07],
[-1.6464e-07],
[ 1.1097e-07]])
0 Traning Loss: tensor(0.1058)
Hidden layer 1 weights gradient: tensor([[-5.2541e-07],
[ 1.9976e-06],
[ 1.1666e-07],
[ 1.9102e-07],
[-7.6507e-07],
[-5.7566e-08],
[ 2.1010e-06],
[ 5.4820e-07],
[ 7.9201e-07],
[ 8.8915e-07]])
1 Traning Loss: tensor(0.1038)
Hidden layer 1 weights gradient: tensor([[-4.4913e-08],
[ 4.0377e-07],
[ 5.9259e-08],
[ 6.0829e-08],
[-2.1468e-07],
[-1.2919e-08],
[ 3.1558e-07],
[ 7.9497e-08],
[ 2.5290e-07],
[ 3.2178e-07]])
2 Traning Loss: tensor(0.0976)
Hidden layer 1 weights gradient: tensor([[ 4.2063e-07],
[-1.0188e-06],
[-4.8912e-08],
[-4.5130e-08],
[ 2.5593e-07],
[-5.5157e-09],
[-1.3609e-06],
[-3.6893e-07],
[-2.9356e-07],
[-2.5360e-07]])
3 Traning Loss: tensor(0.0938)

ptrblck · March 30, 2022, 6:57am

Thanks for the update!
I missed this part of the code:

f_out = f(pt_x_collocation, net) # output of f(x,t)
mse_pinn = mse_cost_function(f_out, pt_all_zeros)
...
loss = mse_pinn + mse_training
loss.backward()

which is not detaching the computation graph and will thus create the gradients.
mse_training will still be detached and not influence the gradient calculation.

EDIT: I also misread the second part and don’t know why you are setting the requires_grad attribute of the targets to True.
In any case that’s not necessary if you don’t need to update the targets and remove the Variable usage.
Once this is done, check the gradients again, verify that they are calculated, check the parameters before and after the step() operation and you should see the update.