File “/lustre/home/tgdong/ML-QMMM/pytorch_heat_ref_new_onlyR_nucdependNN_moreelement/AM1_model_force_AM1dm_NN_ri_encode_smoothri.py”, line 769, in forward
e, c = torch.linalg.eigh(F_new+1.0e-9)
(Triggered internally at …/torch/csrc/autograd/python_anomaly_mode.cpp:104.)
inputs, allow_unused, accumulate_grad=False)
Traceback (most recent call last):
File “pytorch_lossin_force_allmolineachbatch_bigmol_readmask_nucparm_am_smoothri_diffweightri_w100_sigma.py”, line 560, in
train_loss1= train_loop_RMSE(training_data,mol_ind_all,model, loss_fn, optimizer,lambda_AM1inti,penaltyAM1,batch_size,w_E,w_chg,w_dip,w_F)
File “pytorch_lossin_force_allmolineachbatch_bigmol_readmask_nucparm_am_smoothri_diffweightri_w100_sigma.py”, line 185, in train_loop_RMSE
pred_E,pred_E_HEAT,out_correction,pred_F = model(x)
File “/lustre/software/anaconda/anaconda3-2019.10-py37/envs/pytorch-gpu-1.3.1-py37/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1102, in _call_impl
return forward_call(*input, **kwargs)
File “/lustre/home/tgdong/ML-QMMM/pytorch_heat_ref_new_onlyR_nucdependNN_moreelement/AM1_model_force_AM1dm_NN_ri_encode_smoothri.py”, line 843, in forward
grad_e = -torch.autograd.grad((e_heat_formation).sum(), input_atomcoord, create_graph=True)[0]
File “/lustre/software/anaconda/anaconda3-2019.10-py37/envs/pytorch-gpu-1.3.1-py37/lib/python3.7/site-packages/torch/autograd/init.py”, line 236, in grad
inputs, allow_unused, accumulate_grad=False)
RuntimeError: Function ‘LinalgEighBackward0’ returned nan values in its 0th output.

It would appear that you are calling torch.linalg.eigh() in your forward
pass. Note that when an otherwise well-behaved eigenvector / eigenvalue
problem has eigenvalues that are (nearly) degenerate, gradients with
respect to eigenvectors become ill defined.

(In this case, the term degenerate eigenvalues means eigenvalues that are
equal to one another.)

You might try printing out the eigenvalues during the forward pass and see
whether any of them are close to being degenerate.

Gradients computed using the eigenvectors tensor will only be finite when A has distinct eigenvalues. Furthermore, if the distance between any two eigenvalues is close to zero, the gradient will be numerically unstable, as it depends on the eigenvalues λi through the computation of 1 / min i≠j λi−λj .

(The core issue is that when eigenvalues are degenerate, individual
eigenvectors are no longer uniquely defined. Instead, eigen-subspaces
are uniquely defined, but the choice of which eigenvectors within a given
eigen-subspace to use as its basis becomes arbitrary.)

If you have degenerate eigenvalues and you take the gradient of something
(for example, a loss function) that depends on the eigenvectors, you will
inevitably meet this problem. It is not merely a “numerical issue” nor a
problem with pytorch’s eigh() implementation.

You can take the gradient of something that depends on the eigenvalues
(and not the eigenvectors themselves), but if this is your use case, you
might want to make this explicit by using eigvalsh() to compute just the
eigenvalues.

Hi Frank! I’m currently experiencing a problem similar to this one: i’m using a truncated svd decomposition on the input variable Z. The loss function is the mse of the reconstructed Z and the groundtruth, which leads to a RuntimeError: function ‘LinalgSvdBackward0’ returned nan values in its 0th output. could you explain the reason for this in more detail for me？
I would be very grateful if you could help me with the explanation.

This is essentially the same issue as in the linalg.eigh() case.

Singular-value decomposition is a a generalization of eigendecomposition and
when singular values (the analogs of eigenvalues) become degenerate, gradients
with respect to U and V (or Vh) will diverge. The only real solution is to modify
your approach so that you do not take gradients with respect to U or V in cases
where your singular values might become degenerate.

Note that torch.svd() has been deprecated in favor of torch.linalg.svd(),
which has a slightly different syntax.

As an aside, it probably would have been better to start a new thread about
this rather than adding a post to the linalg.eigh() thread. (You can always
link to an old thread if you think it provides helpful context.)

Thank you for your guidance! I appreciate your patience and I’m committed to learning and abiding by the rules in the future.
I have start a new thread about this and I still have some questions, could you please take a look at my question and provide me with an answer in my new thread?