RuntimeError: Function 'MulBackward0' returned nan values in its 0th output

nios93 · December 5, 2020, 9:17pm

Hello,

I am facing the same RuntimeError. The autograd anomaly detection shows that I perform an inplace operation in variable Z.

def update_clusters(self,cluster_idx,sq_distance,Data):
       
        if self.CUDA:
            z = torch.cuda.FloatTensor(self.k_centers, self.Dim).fill_(0)
            o = torch.cuda.FloatTensor(self.k_centers).fill_(0)
        else:
            z = torch.zeros(self.k_centers, self.Dim)
            o = torch.zeros(self.k_centers)
       
        self.lambdas_full=sq_distance**0.5
        self.inv_lambdas_full=1/self.lambdas_full+1e-08
        lambdas=self.lambdas_full[torch.arange(self.N,device=self.device),self.local_cl_idx]
        inv_lambdas=1/lambdas
      

        self.lambdas_X=torch.mul(Data,inv_lambdas.unsqueeze(-1))
        
        z=z.index_add(0, cluster_idx[self.mask_split], self.lambdas_X[self.mask_split])
        o=o.index_add(0, cluster_idx[self.mask_split], inv_lambdas[self.mask_split])


        self.centroids=torch.mul(z,(1/(o+1e-06)).unsqueeze(-1))

The error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace
operation: [torch.cuda.LongTensor [1000]] is at version 7; expected version 0 instead. 
Hint: the backtrace further above shows the operation that failed to compute its gradient. 
The variable in question was changed in there or anywhere later. Good luck!

In the previous function the parameter is the Data tensor.
I do not see how I perform an inplace operation.
Any idea on what might causing that?
Thank you.

f10w · March 2, 2021, 2:30pm

Hi @albanD. Thanks for providing useful insights on debugging this tricky issue! I have encountered the same problem but have been unable to overcome it.

I obtained the following traceback:

[W python_anomaly_mode.cpp:104] Warning: Error detected in MulBackward0. Traceback of forward call that caused the error:
  File "train_eval.py", line 291, in <module>
    start_epoch=0,
  File "train_eval.py", line 126, in train_eval_model
    s_pred_list = model(data_list, points_gt_list, edges_list, n_points_gt_list, perm_mat_list)
  File "/home/user/envs/pytorch-gpu-1.7.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/user/code/matching/BB_GM/model.py", line 141, in forward
    for gm_solver, unary_costs, quadratic_costs in zip(gm_solvers, unary_costs_list, quadratic_costs_list)
  File "/home/user/code/matching/BB_GM/model.py", line 141, in <listcomp>
    for gm_solver, unary_costs, quadratic_costs in zip(gm_solvers, unary_costs_list, quadratic_costs_list)
  File "/home/user/envs/pytorch-gpu-1.7.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/user/code/matching/BB_GM/ADGM.py", line 448, in forward
    tmp = ADGMWrapper(costs[0], costs[1], edges_left.T, edges_right.T, rounding=(not self.training), **self.solver_params)
  File "/home/user/code/matching/BB_GM/ADGM.py", line 232, in ADGMWrapper
    return ADGM(costs, P, rounding=rounding, **kargs)
  File "/home/user/code/matching/BB_GM/ADGM.py", line 145, in ADGM
    Z = X*temp2
 (function _print_stack)
Traceback (most recent call last):
  File "train_eval.py", line 291, in <module>
    start_epoch=0,
  File "train_eval.py", line 132, in train_eval_model
    loss.backward()
  File "/home/user/envs/pytorch-gpu-1.7.1/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user/envs/pytorch-gpu-1.7.1/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'MulBackward0' returned nan values in its 1th output.

Apparently the line that caused the issue is Z = X*temp2 in the code (simplified for convenience):

# Initialization
n1, n2 = U.shape
X = torch.zeros_like(U) + 1.0/n2
Z = torch.zeros_like(U) + 1.0/n1
Y = torch.zeros_like(U)

for i in range(iterations):
    
    # Update X
    ...
    temp = torch.exp(X - torch.max(X, dim=-1, keepdim=True)[0])
    X = Z*temp
    # Normalize: Sum of each row of X is 1
    X = X / torch.max(X, dim=-1, keepdim=True)[0]
    X = X / torch.sum(X, dim=-1, keepdim=True)
    print(f'X normalized:\n {X}')
    print(f'X normalized sum over row:\n {torch.sum(X, dim=-1)}')

    # Update Z
    ...
    temp2 = torch.exp(Z - torch.max(Z, dim=-2, keepdim=True)[0])
    Z = X*temp2
    # Normalize: Sum of each column of Z is 1
    Z = Z / torch.max(Z, dim=-2, keepdim=True)[0]
    Z = Z / torch.sum(Z, dim=-2, keepdim=True)
    print(f'Z normalized:\n {Z}')
    print(f'Z normalized sum over column:\n {torch.sum(Z, dim=-2)}')

    # Update Y
    ...

The logs right before the traceback are the following:

X normalized:
 tensor([[0.0000e+00, 4.2653e-02, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 9.5735e-01, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 1.7397e-40, 0.0000e+00, 0.0000e+00, 9.9391e-01,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 6.0904e-03],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 1.0000e+00, 0.0000e+00, 2.2040e-19, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 1.0000e+00, 0.0000e+00, 1.7096e-30, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 4.9784e-08, 1.4729e-09, 5.5091e-01, 1.1846e-12,
         0.0000e+00, 0.0000e+00, 1.8706e-17, 1.5666e-26, 4.4909e-01],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 1.0000e+00, 0.0000e+00, 1.6033e-32, 1.0499e-27],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 6.8453e-34, 2.1804e-33, 3.3585e-03,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 5.4462e-15, 9.9664e-01],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 4.0123e-01, 0.0000e+00, 5.9877e-01, 0.0000e+00],
        [1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         6.6972e-33, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 3.3273e-06,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.2851e-34,
         0.0000e+00, 1.0000e+00, 0.0000e+00, 5.3407e-21, 6.9220e-31]],
       device='cuda:0', grad_fn=<DivBackward0>)
X normalized sum over row:
 tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0',
       grad_fn=<SumBackward1>)
Z normalized:
 tensor([[0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 9.9989e-01,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.8221e-04],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 4.6598e-04, 0.0000e+00, 3.2422e-22, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 9.9953e-01, 0.0000e+00, 1.1537e-26, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0164e-10,
         0.0000e+00, 0.0000e+00, 5.2710e-26, 2.1740e-20, 6.0694e-01],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 6.0779e-07, 0.0000e+00, 7.5840e-38, 3.8613e-41],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 4.3132e-12,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 2.9244e-19, 7.7734e-11],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 4.2213e-06, 0.0000e+00, 1.0000e+00, 0.0000e+00],
        [1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0853e-04,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 3.9288e-01],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 5.3232e-07, 0.0000e+00, 9.5763e-26, 2.8026e-45]],
       device='cuda:0', grad_fn=<DivBackward0>)
Z normalized sum over column:
 tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
        1.0000, 1.0000], device='cuda:0', grad_fn=<SumBackward1>)

I also printed the values of temp2 for further investigation:

temp2: 
 tensor([[2.2421e-44, 1.0000e+00, 3.4224e-15, 3.8321e-19, 5.6191e-10, 5.0389e-06,
         7.1892e-10, 1.3331e-01, 1.0000e+00, 1.0497e-04, 1.3752e-05],
        [3.8088e-39, 7.9228e-24, 4.8935e-07, 9.9650e-15, 1.6200e-08, 1.1725e-02,
         4.3550e-10, 2.2858e-04, 3.8238e-20, 1.4496e-07, 2.2137e-02],
        [9.2906e-43, 1.9936e-23, 2.1098e-23, 2.9616e-23, 1.8445e-23, 2.6213e-15,
         1.8063e-07, 4.6620e-04, 1.5636e-24, 1.0600e-09, 4.0346e-16],
        [1.0970e-39, 1.4961e-20, 9.6151e-17, 7.5080e-16, 1.1495e-16, 2.6410e-10,
         2.4117e-06, 1.0000e+00, 1.6358e-17, 4.8629e-03, 2.9368e-10],
        [7.4011e-41, 5.9593e-24, 1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00,
         3.8696e-10, 1.3359e-06, 2.6977e-09, 1.0000e+00, 1.0000e+00],
        [1.9164e-41, 1.7378e-23, 6.1811e-22, 1.3114e-25, 6.0049e-23, 1.2770e-14,
         2.4556e-12, 6.0808e-07, 3.7422e-24, 3.4006e-12, 2.7213e-14],
        [4.8908e-35, 4.5041e-33, 9.3833e-18, 4.9370e-20, 5.4802e-17, 1.4968e-11,
         7.7122e-11, 5.6934e-12, 4.8426e-30, 3.8693e-11, 5.7712e-11],
        [7.4685e-38, 6.6625e-31, 3.3640e-21, 3.4035e-15, 3.7193e-24, 2.0841e-16,
         1.8670e-04, 1.0526e-05, 2.1427e-28, 1.2035e-06, 2.4145e-17],
        [1.0000e+00, 0.0000e+00, 1.8710e-41, 1.4928e-38, 0.0000e+00, 6.2585e-38,
         1.0000e+00, 7.3923e-30, 0.0000e+00, 5.2951e-21, 6.4194e-40],
        [1.3871e-40, 2.5689e-23, 7.0096e-08, 3.0429e-13, 1.6090e-07, 3.8015e-01,
         1.7917e-10, 6.8646e-05, 7.7878e-19, 4.3145e-07, 2.9071e-01],
        [1.2150e-36, 1.0209e-26, 4.6150e-22, 4.5896e-25, 1.7469e-23, 1.8506e-14,
         2.4162e-11, 5.3257e-07, 1.5404e-26, 1.2921e-11, 3.5527e-15]],
       device='cuda:0', grad_fn=<ExpBackward>)

It seems to me that both X and temp2 look numerically good, yet the operation Z = X*temp2 caused NaN values in the 1th output (i.e., the derivative w.r.t. temp2, which is X). Would you have any ideas to fix this please?

Thank you very much in advance for your help!

albanD · March 2, 2021, 7:15pm

The grad will actually be the product between X and the grad flowing from the outputs.
You can add Z.register_hook(print) to print the value of the gradient flowing back (or any other function more complex than a single print if you want to).

f10w · March 3, 2021, 11:33am

@albanD Thanks for your reply. I tried adding register_hook as you have suggested:

Z = X*temp2
Z.register_hook(lambda t: print(f'hook Z = X*temp2 :\n {t}'))

and there indeed is an inf in the results:

hook Z = X*temp2 :
 tensor([[ 3.4877e-07,  8.9964e-16,  1.1122e-21, -1.0528e+01,  4.0491e-02,
         -9.1086e-24, -8.6765e+04, -7.7644e-32,  0.0000e+00,  1.6542e-15,
          3.5175e-22,  1.5529e-22],
        [-4.9097e-07,  6.1179e+07,  1.9082e+09,         inf,  1.1254e+28,
          2.5790e-03,  4.0038e+09,  7.6330e-18,  2.2557e-01,  1.2662e+17,
          4.4087e+20,  5.1942e-04],
        [ 1.2332e+11,  3.0140e+01,  1.5695e-12,  1.4476e+15,  1.4260e+07,
          1.6076e-11,  3.8798e-14, -8.4078e-45,  2.1812e+01,  1.4493e-03,
          4.0887e-10,  3.5181e-13],
        [-2.6470e-14, -5.8299e-17,  1.3027e-22, -4.3809e+30,  5.9074e-01,
         -1.0504e-22, -1.5712e-22, -7.0594e-32,  0.0000e+00,  2.2501e-08,
          1.4606e-22,  7.0133e-22],
        [ 3.1354e-04,  1.3185e-08,  2.5922e+04,  1.0102e+09,  1.2017e+20,
          1.2101e-20,  1.9120e+13, -7.7644e-32,  3.3276e-15,  2.6350e+14,
          1.2902e-09,  0.0000e+00],
        [ 1.7676e+07, -1.5238e-25,  4.9368e-04,  1.4395e+15,  7.1193e+09,
          1.7219e-11,  3.3287e-05,  2.9456e-30,  6.3035e-13,  4.8680e+02,
          2.3802e-06,  5.6470e-12],
        [ 3.4877e-07,  1.4774e-16,  1.0646e-21, -1.0528e+01,  9.7694e-02,
         -1.0880e-22, -1.9520e-24, -7.7644e-32,  0.0000e+00,  7.3913e-10,
          3.0468e-22,  3.6886e-22],
        [ 1.6647e+21,  4.3562e+05,  9.4102e+04,  3.1685e+27,  1.9394e+24,
          4.4268e-04,  2.5150e+05,  1.2388e-05,  0.0000e+00,  2.7266e+15,
          1.8464e+05,  1.0878e-02],
        [ 3.2743e-07, -4.0320e-16,  0.0000e+00, -1.1230e+01,  1.2552e+01,
         -2.1239e-22,  2.8698e-11, -5.4771e-32,  0.0000e+00,  5.0166e+10,
          7.6966e-23,  9.5051e-22],
        [ 3.4877e-07,  8.8887e-16,  1.2026e-21, -1.0528e+01, -8.1138e-36,
         -5.4995e-23,  8.7266e-23, -7.7644e-32,  0.0000e+00, -1.1931e-03,
          3.9589e-22,  2.1862e-22],
        [ 7.0485e-04,  1.4685e-04,  1.3167e-16,  6.1364e+27,  3.4774e+02,
          0.0000e+00,  3.2764e-17, -7.7644e-32,  1.9891e-25,  8.2295e-08,
          2.5136e-13,  7.1674e-22],
        [ 3.2743e-07, -4.0320e-16, -7.2629e-31, -8.8818e-07,  6.9897e-01,
         -2.1239e-22, -2.2953e-22, -7.8096e-32,  0.0000e+00,  2.4534e-08,
          0.0000e+00,  9.5051e-22]], device='cuda:0')

But then how should I proceed to further debug? I have checked the documentation for register_hook but there is not much to learn. Could you please tell me which quantity the above output corresponds to? Is it dL/dZ? (Here L is the loss function and Z is the node X*temp2 in the computation graph).

I have come up with a solution that seems to work: replacing Z = X*temp2 with Z = (X + 1e-10)*temp2, but my questions above remain as I think they are useful to future readers.

albanD · March 3, 2021, 3:36pm

Yes exactly. This is the gradient of the loss wrt the Tensor you registered the hook onto.

The “best” solution would then look at the different use of Z and find out where the inf gradient comes from. And fix it where it appears.
If you cannot, another option is to clamp the gradient at this point to ensure they are finite (hook can return a new value that will be used): z.register_hook(lambda grad: grad.clamp(max=1e9))

f10w · March 3, 2021, 4:06pm

Great. Thanks again @albanD

wml1993 · August 10, 2022, 1:56am

@albanD
could you help me solve problem?

D:\install_location\anaconda\envs\python36\lib\site-packages\torch\autograd_init_.py:156: UserWarning: Error detected in MulBackward0. Traceback of forward call that caused the error:
File “D:/STA/code/exam/step1_react_ml.py”, line 466, in
main(‘res/step1_react/R140506/ml/model/model.ckpt’, 2000, 5000, 3000)
File “D:/STA/code/exam/step1_react_ml.py”, line 199, in main
loss, accuracy = model_fn(batch, model, criterion, device)
File “D:/STA/code/exam/step1_react_ml.py”, line 141, in model_fn
outs = model(mels)
File “D:\install_location\anaconda\envs\python36\lib\site-packages\torch\nn\modules\module.py”, line 1102, in call_impl
return forward_call(*input, **kwargs)
File “D:/STA/code/exam/step1_react_ml.py”, line 324, in forward
in_energy = self.para[0][0] * delta_t + 0 - torch.diff(mels)[:, :, 1] * self.para[1][0]
(Triggered internally at …\torch\csrc\autograd\python_anomaly_mode.cpp:104.)
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
Traceback (most recent call last):
File “D:/STA/code/exam/step1_react_ml.py”, line 466, in
main(‘res/step1_react/R140506/ml/model/model.ckpt’, 2000, 5000, 3000)
File “D:/STA/code/exam/step1_react_ml.py”, line 204, in main
loss.backward()
File “D:\install_location\anaconda\envs\python36\lib\site-packages\torch_tensor.py”, line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "D:\install_location\anaconda\envs\python36\lib\site-packages\torch\autograd_init.py", line 156, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Function ‘MulBackward0’ returned nan values in its 1th output.
Train: 0% 0/2000 [00:06<?, ? step/s]

in_energy = self.para[0][0] * delta_t + 0 - torch.diff(mels)[:, :, 1] * self.para[1][0] cause the problem?

jsswoosh · December 3, 2023, 11:20pm

Hi, I am also facing a related issue posted here:
print-intermediate-gradient-values
Please help.