File “/lustre/home/tgdong/ML-QMMM/pytorch_heat_ref_new_onlyR_nucdependNN_moreelement/AM1_model_force_AM1dm_NN_ri_encode_smoothri.py”, line 769, in forward
e, c = torch.linalg.eigh(F_new+1.0e-9)
(Triggered internally at …/torch/csrc/autograd/python_anomaly_mode.cpp:104.)
inputs, allow_unused, accumulate_grad=False)
Traceback (most recent call last):
File “pytorch_lossin_force_allmolineachbatch_bigmol_readmask_nucparm_am_smoothri_diffweightri_w100_sigma.py”, line 560, in
train_loss1= train_loop_RMSE(training_data,mol_ind_all,model, loss_fn, optimizer,lambda_AM1inti,penaltyAM1,batch_size,w_E,w_chg,w_dip,w_F)
File “pytorch_lossin_force_allmolineachbatch_bigmol_readmask_nucparm_am_smoothri_diffweightri_w100_sigma.py”, line 185, in train_loop_RMSE
pred_E,pred_E_HEAT,out_correction,pred_F = model(x)
File “/lustre/software/anaconda/anaconda3-2019.10-py37/envs/pytorch-gpu-1.3.1-py37/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1102, in _call_impl
return forward_call(*input, **kwargs)
File “/lustre/home/tgdong/ML-QMMM/pytorch_heat_ref_new_onlyR_nucdependNN_moreelement/AM1_model_force_AM1dm_NN_ri_encode_smoothri.py”, line 843, in forward
grad_e = -torch.autograd.grad((e_heat_formation).sum(), input_atomcoord, create_graph=True)[0]
File “/lustre/software/anaconda/anaconda3-2019.10-py37/envs/pytorch-gpu-1.3.1-py37/lib/python3.7/site-packages/torch/autograd/init.py”, line 236, in grad
inputs, allow_unused, accumulate_grad=False)
RuntimeError: Function ‘LinalgEighBackward0’ returned nan values in its 0th output.
It would appear that you are calling torch.linalg.eigh() in your forward
pass. Note that when an otherwise well-behaved eigenvector / eigenvalue
problem has eigenvalues that are (nearly) degenerate, gradients with
respect to eigenvectors become ill defined.
(In this case, the term degenerate eigenvalues means eigenvalues that are
equal to one another.)
You might try printing out the eigenvalues during the forward pass and see
whether any of them are close to being degenerate.
Gradients computed using the eigenvectors tensor will only be finite when A has distinct eigenvalues. Furthermore, if the distance between any two eigenvalues is close to zero, the gradient will be numerically unstable, as it depends on the eigenvalues λi through the computation of 1 / min i≠j λi−λj .
(The core issue is that when eigenvalues are degenerate, individual
eigenvectors are no longer uniquely defined. Instead, eigen-subspaces
are uniquely defined, but the choice of which eigenvectors within a given
eigen-subspace to use as its basis becomes arbitrary.)
If you have degenerate eigenvalues and you take the gradient of something
(for example, a loss function) that depends on the eigenvectors, you will
inevitably meet this problem. It is not merely a “numerical issue” nor a
problem with pytorch’s eigh() implementation.
You can take the gradient of something that depends on the eigenvalues
(and not the eigenvectors themselves), but if this is your use case, you
might want to make this explicit by using eigvalsh() to compute just the
eigenvalues.