Saving a Variable with requires_grad=True in a dictionary is not being updated

muammar · January 17, 2019, 7:33pm

Hi, I am still very new to PyTorch and after searching google for this doubt I could not find anything. Hopefully, someone could help me.

I have a forward() method that looks as follows:

  def forward(self, feature_vector):
      activation_function = {'tanh': torch.tanh, 'relu': F.relu}

      symbol, X = feature_vector

      X = self.backend.from_numpy(X)

      for i, l in enumerate(self.linears[symbol]):
          if i != self.out_layer_indices[symbol]:
              X = activation_function[self.activation_function](l(X))
          else:
              X = l(X)

       X = (self.slope[symbol] * X) + self.intercept[symbol]
      return X

The problem I have is with the line before the return. I also have a method called train where I defined self.slope and self.intercept using a python dictionary:

      self.slope = {}
      self.intercept = {}
      
       
      for symbol in unique_element_symbols:
          linears = []
      
          intercept = some operation
          intercept = self.backend.from_numpy(intercept)
          self.intercept[symbol] = Variable(intercept, requires_grad=True)
       
          slope = some operation
          slope = self.backend.from_numpy(slope)
          self.slope[symbol] = Variable(slope, requires_grad=True)
          # I create a neural network here with Linear() + nn.ModuleDict

      # Model is trained in this loop
      for epoch in range(self.epochs):     
          self.forward(tensor)
          #  some more code to have the outputs
          criterion = nn.MSELoss()
          loss = torch.sqrt(criterion(outputs, targets))
          self.optimizer.zero_grad()  # clear previous gradients
          loss.backward()
          self.optimizer.step()

If my understanding of the documentation is correct, the creation of Variable with requires_grad=True should make autograd aware of the existence of those tensors. I can see the weights of the layers are being updated but those variables I created inside the dictionary not. As they are part of the forward() method they have to affect the output and then they have to change according to the gradient. Are those variables not updated because they are in a python dictionary?

I would be glad if someone could help me to understand what would be the issue with my code.

ptrblck · January 18, 2019, 4:52am

Most likely there parameters are not properly registered and thus unknown to the optimizer.

You could try to use a nn.ModuleDict instead, but you would need to wrap your parameters in a nn.Module, since plain nn.Parameters won’t be recognized as far as I know.

PS: Variables are deprecated since 0.4.0. Just use torch.tensor(..., requires_grad=True) instead.

chenglu · January 18, 2019, 5:14am

The core reason of “those variables not updated” is because they are not registed in the optimizer. If you look at any optimizer’s constructor, there will be a parameter named params, which is expected to hold the parameters you want to automatically updated by grad.

I’m not sure how you initialized your optimizer. If you do create the optimizer in a most common way, you should make whatever you want to updated automatically as a parameter of the model by using self.register_parameter in the model’s constructor. And then, you could do torch.optim.SGD(model.parameters() since now the model.parameters() will return the variables you just registed.

If you have a hard time to put slop or intercept in the model’s parameters, you could also register those tensors directly to the optimizer by using self.optimizer.add_param_group

muammar · January 18, 2019, 10:32pm

Thank you @ptrblck and @chenglu. Both of you were right. The parameters were not correctly registered when using a python dictionary.

I tried using nn.ModuleDict and nn.Parameters without success. However, the self.register_parameter() did the trick for me.

I removed the dictionaries for those parameters and added this instead:

self.register_parameter(intercept_name, intercept)
self.register_parameter(slope_name, slope)

To access them in the forward() method I had to do:

for name, param in self.named_parameters():
    if intercept_name == name:
        intercept = param 
    elif slope_name == name:
        slope = param 
       
X = slope * X + intercept

Now intercept and slope are changing according to the gradient. Is there any way to access these parameters without using the loop over name, param shown above?

ptrblck · January 19, 2019, 1:39am

I’m glad it’s working!
You can just use the name you’ve used to register the parameters, e.g.:

X = self.slope_name * X + self.intercept_name

muammar · January 21, 2019, 3:31am

Thanks. That worked.

muammar · February 4, 2019, 6:34am

This is me, again… Something weird is happening. When I register the variables as we discussed above in this class (I pasted it in a gist because it is long), not all variables are being seen by autograd. There should be a total of 8 tensors but at the end of optimization only 6 of them are shown (relevant part of the output below):

outputs
tensor([-14.5772266388, -14.5772266388])
targets
tensor([-14.5868730545, -14.5640010834])
No diff in intercept_Cu
No diff in slope_Cu
Diff in linears.Cu.0.weight
No diff in linears.Cu.0.bias
Diff in linears.Cu.2.weight
No diff in linears.Cu.2.bias
Diff in linears.Cu.4.weight
No diff in linears.Cu.4.bias

Optimized parameters for Cu symbol
Index 0
Parameter containing:
tensor([[ 1.5998219169e-05, -1.1084647089e-11,  3.4887983702e-07,
         -5.9022102505e-05,  1.5358797100e-05, -2.2421262713e-07,
          3.9578364522e-05, -2.5841705792e-05],
        [-1.2979450847e-11,  2.4929600051e-11, -4.3761640145e-11,
          8.7308825414e-07,  6.8003464548e-07, -6.9001464453e-07,
         -3.0529092328e-05, -1.9285680537e-06],
        [-3.4112128677e-09, -2.0672181415e-12,  1.0248225879e-12,
          1.3090937500e-05, -1.9991681199e-08, -1.2244654499e-05,
          1.1959917501e-09, -1.1793726173e-07],
        [ 9.3959987360e-12, -2.8132822081e-06, -7.1578106144e-06,
         -1.5608311514e-06,  7.4273208156e-05, -6.5615221589e-13,
          1.0243820725e-04,  2.6734230119e-07],
        [-2.8905316867e-05,  1.7972409978e-06,  2.8471620681e-05,
          1.1441625247e-06, -4.3263348743e-06,  9.2861837402e-06,
         -7.3636897469e-08, -6.2427188823e-06],
        [ 1.8716022510e-08, -4.3462468966e-06, -7.1537678559e-11,
          4.4766447493e-13, -4.2634189867e-07,  6.2688843006e-10,
         -1.5413985643e-09, -1.9352362415e-06],
        [-4.0789027480e-06,  1.7624552484e-08, -5.8772937336e-05,
          1.3928577259e-12,  1.4477242303e-06, -6.5660731252e-07,
          1.3057894830e-04,  1.0623334674e-06],
        [ 2.8627397342e-07,  7.6879496191e-07, -1.5201392500e-07,
          9.4639290182e-08,  1.7211885250e-09, -3.1544458712e-10,
         -3.1436915742e-04, -9.5523216004e-09],
        [ 5.4327131238e-07,  5.3367260989e-05,  3.0272097329e-11,
         -2.5873794129e-06, -2.5613280741e-07,  4.1264866013e-05,
          1.3438809527e-12, -5.6481166411e-09],
        [-6.4899657445e-05, -4.3667625960e-08, -6.4955729684e-10,
          7.9043999790e-08, -7.7281238191e-06,  1.7655082047e-05,
         -1.6245309098e-07, -1.7478591019e-08]], requires_grad=True)
Gradient tensor(0.0126342149)

Index 1
Parameter containing:
tensor([ 0.0846629143,  0.2052433789,  0.1129320264,  0.1384415329,
         0.2349925339, -0.1073408127,  0.2195934355,  0.3364700377,
         0.1929847300, -0.0893238783], requires_grad=True)
No gradient?

Index 2
Parameter containing:
tensor([[-1.4006408492e-05, -1.3260194009e-06,  1.4346720434e-07,
         -6.5448512032e-07,  2.9784255275e-06,  4.5995878547e-13,
          6.7223256337e-05,  6.4453017576e-12,  1.0301571401e-10,
         -1.2009696349e-08],
        [-2.2814828071e-07, -5.8791869151e-08, -3.9165245835e-04,
         -2.5221936539e-06,  1.1180619595e-06, -2.6514657293e-05,
         -1.4766897038e-07,  2.7023989242e-04, -2.9795790401e-12,
          3.4368467823e-06],
        [ 3.6120570712e-06, -3.7223298568e-04,  7.1171717408e-09,
         -4.0368172449e-06, -1.1812019807e-07, -9.0479334176e-06,
         -9.7775303479e-12,  3.3027842505e-07, -2.2225761143e-07,
          1.7060537516e-07],
        [ 4.7848516260e-05,  1.4109857602e-06, -4.7986867813e-09,
         -1.1886934145e-11, -1.5743089534e-06, -1.9210867777e-06,
          2.5946489401e-10,  7.1065740485e-05, -7.2540847214e-06,
         -2.9720404740e-13],
        [ 7.8338234744e-07,  2.9897366403e-05,  1.0493286936e-05,
         -1.2905216806e-07, -5.0532015905e-08, -1.4369081327e-05,
          5.9140187659e-05,  1.8394788640e-05,  2.8736901004e-04,
         -7.9514339557e-11],
        [-3.5491411109e-04,  3.9472433855e-06, -3.6779524635e-06,
          1.3279050108e-05,  1.0775630388e-09,  2.0076269536e-09,
          2.2207383154e-05,  1.0671607924e-05,  3.5179223801e-07,
          8.3256582002e-06],
        [-4.0831773518e-09,  3.4044984204e-05,  3.9824635678e-07,
         -5.4254252291e-07, -8.2707781596e-12,  7.9960360555e-10,
          1.6246242751e-07, -1.5748057303e-09, -4.6191617002e-05,
          1.4769234986e-04],
        [ 6.0335892158e-06,  4.0175755203e-06,  2.3420781872e-05,
         -1.4100555745e-07,  4.3824256863e-06, -1.9676244847e-05,
         -4.2883926653e-05,  2.6943742341e-05,  1.5044579982e-07,
          3.4529236359e-08],
        [-2.4134715204e-05,  3.6303499655e-05, -1.0801615247e-07,
          8.3609793364e-06,  3.0849619179e-06, -8.6793288574e-06,
          2.4900288554e-04,  8.5335452355e-14, -3.4220584699e-11,
         -4.0262357288e-06],
        [-3.2995540096e-06, -9.5245795251e-08,  2.4340472748e-08,
         -3.7661432133e-13, -4.4606429661e-09, -7.5562275015e-06,
         -6.9999718107e-05,  1.4586039470e-04,  1.0552175809e-06,
         -6.1385714220e-12]], requires_grad=True)
Gradient tensor(-0.0002918996)

Index 3
Parameter containing:
tensor([-0.0368886292, -0.1048975587, -0.2438423038, -0.2089971900,
         0.2615807354,  0.0241439044, -0.1016014665,  0.2302859128,
        -0.2738550305, -0.2952967882], requires_grad=True)
No gradient?

Index 4
Parameter containing:
tensor([[-1.6471599520e-04,  5.0920876674e-05,  1.6964193492e-05,
         -7.2204138633e-06, -7.4410144713e-11,  1.3845928848e-09,
          2.6772568162e-07,  4.4445322422e-11,  3.0647162930e-05,
         -4.6163746447e-05]], requires_grad=True)
Gradient tensor(-0.0134047084)

Index 5
Parameter containing:
tensor([-0.1564691514], requires_grad=True)
No gradient?

I understand biases are not counted, but only the weights of the layers are retained. Additionally, the loss seems to be decreasing in error with each epoch but the outputs of the model remain the same.

Do you see any problem in the class I have built? what would you recommend to check? I would really appreaciate any suggestions. I am lost here.

ptrblck · February 4, 2019, 11:14pm

Could you just for the sake of debugging set the learning rate quite high, e.g. 100 and run a single update step to check if the parameters get updated?
I would like to make sure that the gradients are not too small and we are thus not seeing any updates in the parameters even though the code should generally work.

If that still doesn’t work, could you point me to some lines of code in your Gist so that debugging would be a bit faster?

muammar · February 4, 2019, 11:40pm

I tried it and this is the output:

outputs
tensor([-14.7144765854, -14.7144765854])
targets
tensor([-14.5868730545, -14.5640010834])
No diff in intercept_Cu
No diff in slope_Cu
Diff in linears.Cu.0.weight
No diff in linears.Cu.0.bias
Diff in linears.Cu.2.weight
No diff in linears.Cu.2.bias
Diff in linears.Cu.4.weight
No diff in linears.Cu.4.bias

Optimized parameters for Cu symbol
Index 0
Parameter containing:
tensor([[ 8.5844440460e+00,  5.7265067101e-01,  9.7642364502e+00,
         -6.3340994529e-05, -6.5706312656e-01, -1.0317638516e-01,
         -8.1990205217e-04,  3.8604885340e-02],
        [ 5.6350793839e+00, -2.3107304573e+00,  3.0389562016e-04,
          4.8599953651e+00,  6.9255195558e-03, -1.2411396503e+00,
         -1.9741505384e-01,  1.5696491755e-04],
        [-5.0464026572e-05, -3.3354687691e+00,  8.9420490265e+00,
          3.8780815601e+00,  2.3376888130e-03, -5.5896580219e-02,
          1.3885598630e-02, -1.0426228866e-02],
        [ 9.5443532337e-04,  6.6207236052e-01, -8.9323358536e+00,
          6.4504299164e+00, -5.6406411204e-06, -1.3022724390e+00,
         -6.7651176453e-01, -4.0060423315e-02],
        [-7.3771867752e+00, -2.8842912674e+01, -1.2261917114e+01,
          2.0418181084e-03, -3.8052463531e+00, -4.1274856776e-02,
          7.0336312056e-02,  7.7507920563e-02],
        [ 1.1060595512e-01, -3.2884517312e-01,  1.3592758179e+00,
         -3.1894344091e-01, -1.7296176404e-02, -6.2223523855e-02,
         -1.0991756916e+00, -1.0611775797e-03],
        [-1.0529378653e+00,  8.7992340326e-02, -3.9956837893e-02,
          5.1080572605e-01,  7.3425645828e+00, -1.4347046090e-05,
          4.9415111542e-02,  1.4767070770e+01],
        [-6.3522720337e-01, -3.3820127137e-03,  1.0707162857e+01,
          1.5198823530e-03, -5.2807319164e-01,  5.2644854784e-01,
         -1.2110622600e-02, -2.9190010391e-03],
        [ 3.0526014045e-02, -1.3536047190e-03, -3.8478989154e-04,
          2.9252339154e-03,  5.4483871460e+00,  7.9564154148e-03,
         -1.8055616617e+00, -6.4464583993e-03],
        [-4.6962329745e-01,  8.6185136752e-06,  2.4837136269e-02,
         -4.1209143092e-05,  4.2492513657e+00,  8.4312686920e+00,
          1.9236560433e-07,  3.2226529717e-01]], requires_grad=True)
Gradient tensor(-0.0052632340)

Index 1
Parameter containing:
tensor([ 0.3178096116,  0.0436611772, -0.2040621340, -0.0848887563,
         0.2899220884, -0.2525188029,  0.3507566750, -0.1945837736,
         0.1707542241, -0.0507352650], requires_grad=True)
No gradient?

Index 2
Parameter containing:
tensor([[ 9.4454865903e-03, -3.8685989380e-01,  2.8510479927e+00,
          1.6451107513e-05,  7.2450813605e-05, -2.4000716209e-01,
         -1.0067681968e-01, -6.8808451295e-02, -1.3941727579e-02,
         -1.1572503299e-01],
        [-4.0232582251e-05,  1.5237447619e-01, -8.4863287952e-08,
         -3.3062148094e-01,  3.1492298841e-01,  7.1657931805e-01,
          6.5576374531e-02,  5.8732334524e-02, -1.4156305790e-01,
         -1.1431868374e-01],
        [ 3.6709681153e-01, -5.6241098791e-03, -1.8890530029e-08,
          2.2205217101e-04, -3.8731803894e+00, -8.4317040443e-01,
         -3.5567022860e-03,  7.6645493507e-02, -1.7931096554e+00,
         -2.0117998123e+00],
        [-2.6692817919e-03,  2.3045387268e+00, -1.9369858503e-01,
          1.1653967202e-02, -1.5044789314e+00,  2.6386910677e-01,
         -1.8918566406e-02,  2.4579927325e-02, -8.7022192020e-05,
          1.4020656636e-07],
        [ 2.8100815415e-01, -1.4995394740e-03,  3.7862854004e+00,
          2.3118360519e+01, -1.2707098722e+00, -8.9394124225e-03,
         -7.3012824942e-06,  6.0733418650e-07, -6.8714976311e-02,
         -1.7940466932e-04],
        [-1.1861760616e+00, -1.7072351277e-01,  7.4709236622e-02,
         -1.6057054698e-01,  1.0028474033e-01,  4.4707970619e+00,
         -3.2747825980e-01,  1.8114055820e-06, -6.0276460648e-01,
         -2.9894538879e+01],
        [-2.0331549644e-01, -9.2998981476e-01, -2.3422073573e-03,
          6.5794992447e-01, -4.0772670507e-01, -1.7908929586e+00,
         -4.3703973293e-02, -2.3664340377e-02,  3.4835241735e-02,
          7.3881530762e-01],
        [ 5.4340696335e-01,  1.3241521083e-04, -3.2028186321e-01,
         -1.6411489248e-01, -8.0035102367e-01, -1.0085972399e-01,
         -2.3231016099e-01,  9.6048679352e+00, -1.3925330162e+01,
         -8.2148885727e-01],
        [-6.0046720505e-01,  1.0296676308e-02, -5.9643266723e-03,
         -2.2244569845e-04,  1.5874393284e-03,  9.7708535194e-01,
         -7.4371069670e-02,  8.7442662334e-05, -2.0362114906e-01,
          1.5027550697e+01],
        [ 1.2441553175e-02,  3.5354614258e+00, -4.9783945084e-01,
          1.0338279605e-01,  2.9940547943e+00, -1.0266765952e-01,
          1.2045311928e-01, -3.1238024235e+00,  3.3330893517e+00,
         -4.7617787123e-01]], requires_grad=True)
Gradient tensor(0.0078644780)

Index 3
Parameter containing:
tensor([-0.0413947999,  0.2711434066,  0.0748769045,  0.1031675935,
         0.0756872594,  0.3022760451,  0.2172745764, -0.2653046250,
         0.2037093341, -0.0445466638], requires_grad=True)
No gradient?

Index 4
Parameter containing:
tensor([[-4.1193764657e-02, -3.5373184830e-02,  1.8808110617e-03,
         -3.8154840004e-03, -2.7028546333e+00, -1.3087383270e+01,
         -3.1675234437e-02, -7.3683762550e-01, -4.4051003456e-01,
         -1.4208417851e-03]], requires_grad=True)
Gradient tensor(0.0099373152)

Index 5
Parameter containing:
tensor([-0.1635424048], requires_grad=True)
No gradient?

Now, by downloading the gist the changes can be made in L-342. Thank you very much for your help. I am embarrassed but honestly lost.

ptrblck · February 5, 2019, 12:09am

Thanks for the info!
I tried to debug your code and stumbled upon this line of code. You are detaching image_energy by calling .item(). Later in your get_loss method, you are using outputs to calculate the gradients.
However, since image_energy was detached, the computation graph shouldn’t compute any valid gradients before this point.
I’m not completely understanding the code, so let me know, if I’m on the wrong path.

muammar · February 5, 2019, 12:36am

Thanks for the info!
I tried to debug your code and stumbled upon this line of code. You are detaching image_energy by calling .item() . Later in your get_loss method, you are using outputs to calculate the gradients.
However, since image_energy was detached, the computation graph shouldn’t compute any valid gradients before this point.

I think I understand what you mean about the problem I create when detaching the image_energy with .item(). I proceeded to change the line of code you referred above removing the .item() and changed L-202 to outputs = torch.tensor(outputs, requires_grad=True). I still get the same problem. Then it means that the tensors I am using to compute the loss function are broken. Is that correct?

I’m not completely understanding the code, so let me know, if I’m on the wrong path.

Your analysis is right. Let me just give a brief idea about what is going on. These targets are two energies (scalars) of two molecules with 4 atoms each. The inputs, in this case, are 4 vectors for each each molecule because there are 4 atoms (see here). Applying forward to those features will return atomic energies. That is why I sum them, add their sum (image_energy) to the outputs array, and then pass them to the loss. I probably am failing to see how to avoid summing those 4 atomic energies to recover the total energy and avoid detaching and breaking the graph calculation.

ptrblck · February 5, 2019, 12:46am

Thanks for the explanation!
Your experiment sounds really interesting.

Rewrapping output in a tensor also detaches the graph.
Your code should work with the following changes:

...
     outputs.append(image_energy)
outputs = torch.stack(outputs)
loss, rmse = self.get_loss(outputs, targets, 4)

Could you check for valid results?

muammar · February 5, 2019, 1:13am

I just checked my commits and I was using torch.stack but dropped it at some point. With that change, now I am getting this (thanks for spotting that!);

outputs
tensor([[-14.5754413605],
        [-14.5754413605]], grad_fn=<StackBackward>)
targets
tensor([-14.5868730545, -14.5640010834])
Diff in intercept_Cu
Diff in slope_Cu
Diff in linears.Cu.0.weight
Diff in linears.Cu.0.bias
Diff in linears.Cu.2.weight
Diff in linears.Cu.2.bias
Diff in linears.Cu.4.weight
Diff in linears.Cu.4.bias

Optimized parameters for Cu symbol
Index 0
Parameter containing:
tensor([[-3.2861328236e-06,  2.6473844628e-05,  3.5650995045e-08,
          1.4864326658e-06, -1.0720622959e-04,  3.0406141605e-08,
          1.3593539488e-11, -1.9859398570e-09],
        [-5.8556226534e-10, -1.1646034137e-10, -3.3600910682e-12,
          6.7106892265e-10,  1.8430428624e-13,  1.9599950107e-08,
          2.4708364435e-05,  3.2043487863e-07],
        [ 1.9733222143e-04,  1.3596744564e-10,  1.6436105810e-08,
         -3.0531379647e-09, -6.3785437305e-06,  7.4555270811e-13,
          8.1752958067e-05, -1.1181630725e-05],
        [-1.2528111881e-09,  4.6580535127e-06,  2.3549859979e-12,
          2.8091712984e-11,  1.7993905931e-04, -1.4886735508e-12,
         -9.2567507479e-12,  2.8859590202e-06],
        [ 3.4168685943e-06, -5.6807679357e-05,  1.2368669559e-05,
         -2.1798576399e-12, -1.4500128600e-05, -3.1362407071e-07,
          1.7807322283e-09,  9.7959136838e-06],
        [ 8.9927290503e-08, -5.2266013739e-09, -9.1957379753e-14,
         -2.9821003977e-07, -1.7513568764e-06,  1.0372443600e-13,
          1.2319574694e-07, -3.6574114347e-05],
        [ 5.3554540500e-06, -1.8524660845e-05,  1.1853338037e-05,
         -2.1492420638e-04,  2.3621556466e-05, -8.0939061009e-11,
         -6.9240194023e-08,  1.5314364646e-05],
        [-2.0754782781e-08, -1.9774879547e-05, -1.3601642422e-05,
          5.2368657634e-05,  3.3497635741e-05, -5.8766081565e-06,
         -6.5623047703e-05,  3.9108752389e-05],
        [-1.5761332861e-06, -7.3087621786e-06, -2.9493878628e-07,
         -4.5463502829e-07, -3.2682427786e-07, -1.4819252101e-05,
          2.6041425372e-05, -1.0358776308e-06],
        [-6.6475655558e-07, -1.3438479496e-10,  1.8068027430e-07,
         -2.5042306007e-09, -5.2879945542e-06, -6.9557786446e-06,
          2.9763690179e-08, -2.8302894425e-14]], requires_grad=True)
Gradient tensor(-6.9187924964e-05)

Index 1
Parameter containing:
tensor([ 0.0316366255,  0.2617242932,  0.3015798032, -0.0021922502,
        -0.0615932010, -0.2602513433, -0.0311477333, -0.2361671627,
         0.1662444025, -0.0660640150], requires_grad=True)
Gradient tensor(2.4240243301e-16)

Index 2
Parameter containing:
tensor([[ 9.7365699503e-08,  4.2661922635e-06,  1.9631031591e-06,
         -4.2697833123e-05,  6.2208728195e-06,  1.1655485604e-12,
         -6.4603467763e-05, -1.3209832117e-09,  1.1756450391e-07,
         -7.9867913882e-07],
        [ 6.2096376041e-07, -6.1568898673e-06, -7.3711348136e-10,
          3.8170369088e-09, -1.4567660855e-06, -1.9914123186e-06,
          2.3581033020e-06, -1.4400919781e-06,  1.2830110308e-09,
         -9.6331113753e-08],
        [-1.5564822888e-06,  2.5510303203e-06,  1.5743958670e-07,
         -1.4166996607e-06, -1.3405845323e-07, -9.9951203083e-06,
          1.0537170965e-05,  7.7086369856e-06,  2.7015998813e-11,
          4.8960191457e-10],
        [-1.7156789909e-06, -2.3564821277e-06,  1.5615292262e-13,
          3.7418202217e-13, -4.3808788178e-05, -1.6505031454e-05,
         -8.4225684986e-06,  4.8483889259e-06, -5.0767212656e-08,
          2.3069132737e-07],
        [ 2.8512771678e-05, -7.5202997323e-06,  1.7333918549e-06,
          1.2562820473e-07,  5.2780844271e-05, -1.1339360562e-09,
          3.8166854659e-13, -1.2022780993e-07,  7.6206299127e-05,
          1.7692066194e-06],
        [-1.8082403130e-07,  1.8067876226e-06,  1.5731624337e-10,
         -3.9476603410e-12, -1.6683844706e-07,  1.6850806333e-06,
         -2.5483970489e-10, -1.8325088604e-05, -3.7972899918e-06,
          1.8083280651e-08],
        [ 1.2292147552e-12,  9.3183753052e-06, -1.0420450280e-07,
          3.0822411645e-07,  6.6431852019e-07,  9.7349062145e-10,
          3.3600823372e-05, -2.3434172908e-04, -4.7051515462e-11,
          3.7607719605e-08],
        [ 2.9142852020e-10,  1.6880323983e-07,  3.4797506032e-06,
          2.2227823138e-08, -8.4504938513e-07, -1.0985943663e-04,
          2.5039498723e-05,  8.6511966211e-13,  1.6281462740e-04,
         -2.8856720746e-07],
        [-2.9593747968e-06, -5.8458951457e-08, -5.6971380502e-08,
         -1.2519759184e-04, -7.4558295735e-13, -2.9341401842e-07,
          4.2668673927e-08,  4.3059226300e-06, -2.8244965478e-08,
         -5.0044291129e-06],
        [ 3.0511776004e-06, -1.0126113713e-15,  5.8472587625e-06,
         -7.7287486420e-06, -1.2084972241e-06, -4.0337028162e-10,
          7.3831834015e-05,  1.5755430240e-06,  1.3774927379e-12,
          1.3072969159e-04]], requires_grad=True)
Gradient tensor(-0.0001017772)

Index 3
Parameter containing:
tensor([ 0.1047996879, -0.2729528248,  0.2065724581,  0.0899153724,
        -0.1029036865, -0.3049356043, -0.2874263823,  0.1054596379,
        -0.2320392430, -0.0505143851], requires_grad=True)
Gradient tensor(-1.4201264947e-12)

Index 4
Parameter containing:
tensor([[ 1.3515332284e-06,  1.9205540269e-08,  1.9853887352e-05,
          2.2419371817e-06,  1.0494016323e-08, -4.0517086745e-05,
          1.4675807324e-04,  5.4397496285e-08, -6.9757788879e-08,
          3.3906806038e-06]], requires_grad=True)
Gradient tensor(0.0001229179)

Index 5
Parameter containing:
tensor([-0.2859740555], requires_grad=True)
Gradient tensor(-7.3175310256e-09)

The two variables that were not changing before now they are changing. However, outputs remained the same even though the loss function value at epoch 1000 was 1.643620e-05. Do you know any idea why? Doesn’t that mean that outputs are very near to the targets? Meanwhile, I am playing with learning rate and weight decay and see what happens.

ptrblck · February 5, 2019, 12:58pm

So while the loss decreases the outputs stay approx. the same? Let me know if that’s the case and I do some more debugging.

One issue that comes to my mind is that we’ve recently had similar problems using nn.MSELoss when the model output and target had a shape mismatch and we’re silently broadcasted. Could you check the shapes of all tensors used in nn.MSELoss?

muammar · February 5, 2019, 5:07pm

That is exactly the case. While the loss function is decreasing the outputs remain the same.

OK! I checked, and the shapes are not the same!

output = torch.Size([2, 2])
target = torch.Size([2])

Why stack is changing the shape of the output tensor??? What can I use instead of stack? I really appreciate all your help!!

ptrblck · February 5, 2019, 5:15pm

Great! If you don’t want an additional dimension, you could use torch.cat, but you would end up with 4 values nevertheless so there is a shape mismatch indeed.
If you are dealing with energies for 4 atoms, I would expect the target to also have 4 values. Does it make sense or what am I still missing something?

muammar · February 5, 2019, 5:41pm

It makes sense, but I don’t have access to energies per atom :(. The energy of a molecular system comes from something we call the wave function and solving the Schrödinger equation using that function returns only total energies and not atomic ones.

Now with cat sizes match:

torch.Size([2])
torch.Size([2])

But the loss keeps decreasing without the outputs changing. Do you think I should change the forward method?

ptrblck · February 5, 2019, 5:43pm

Thanks for the explanation. I remember some things from high school about the energy levels etc.

I’m not completely sure what’s going on as I would assume your output should now contain 4 elements.
I would like to debug it a bit later this evening as I currently don’t have access to my machine.

muammar · February 5, 2019, 5:45pm

thanks! I will keep changing things here and there. I am very interested in understanding what is causing this issue.

ptrblck · February 5, 2019, 11:21pm

I tried to debug the code a bit more and in my code the shapes of the output and target were [2, 1] and [2], respectively.
To fix this, I just created the target as:

targets = [[-14.586873530850994], [-14.56400104603344]]

Also, it seems you are passing the input in a shape of [8]. A linear layer would expect an input of [batch_size, nb_features], so I unsqueezed the tensor in dim0:

X = X.unsqueeze(0)
X = self.linears[symbol](X)

The output values are really close to each other and sometimes have even the same values.
If I just change the targets randomly to have a larger distance, the output values also diverge a bit, but generally your input and target seem to contain the “signal” in a low precision range.
Have you thought about some normalization scheme?
I’m a bit afraid that we are currently running into floating point precision issues (~1e-6 would be the limit).
However, even with float64 precision, I couldn’t really fit the data.

loss 9.11804408893418e-6, rmse 0.011436155542466114