LogBackward returned nan values in its 0th output

Ahmed_Abdelaziz · August 14, 2020, 3:14pm

Hi,

While training my model I got NaNs as the result of loss function after many iterations. I used anaomly_detection to see what is causing that issue and I got that error.

Function 'LogBackward' returned nan values in its 0th output

This is the only log funtion I used in my forward path

 return o.log()

And that is a sample of the values it receives

         [0.0004, 0.0006, 0.0010,  ..., 0.0143, 0.0143, 0.0143],
         ...,
         [0.0007, 0.0005, 0.0007,  ..., 0.0146, 0.0146, 0.0146],
         [0.0007, 0.0005, 0.0007,  ..., 0.0147, 0.0147, 0.0147],
         [0.0007, 0.0005, 0.0007,  ..., 0.0147, 0.0147, 0.0147]],

I cannot catch any zeros or negative numbers passed to it. Can you maybe tell me if anyline I should use to catch the number that is causing that issue or anyways to recover from this?

albanD · August 14, 2020, 3:25pm

Hi,

Maybe some values in there are small enough that they produce -inf?
If you add a o = o.clamp(min=1e-4), does it prevent issues?

Ahmed_Abdelaziz · August 14, 2020, 3:50pm

I got same error

    o = o.clamp(min=1e-4)
    return o.log(),planlogits

Traceback (most recent call last):
  File "./train.py", line 351, in <module>
    main(args)
  File "./train.py", line 137, in main
    train_joint
  File "./train.py", line 221, in train
    gr_loss.backward()
  File "~/anaconda3/envs/torch-sci/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "~/anaconda3/envs/torch-sci/lib/python3.6/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'LogBackward' returned nan values in its 0th output.

albanD · August 14, 2020, 3:56pm

And don’t you have a warning showing a second stack trace just above by any chance? That should point to the faulty log function.

Ahmed_Abdelaziz · August 14, 2020, 4:04pm

I see that the warning is pointing to the exact function I pointed

Warning: Error detected in LogBackward. Traceback of forward call that caused the error:
  File "train.py", line 339, in <module>
    main(args)
  File "train.py", line 133, in main
    train_joint
  File "train.py", line 202, in train
    p, planlogits = graph_model(batch)
  File "~/anaconda3/envs/torch-sci/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "~/torch-sci/GraphWriter/models/newmodel.py", line 167, in forward
    return o.log(),planlogits

However, I tried to remove the faulty log function but I got this instead. Do you think it’s a cascaded error coming somehow from a different module?

RuntimeError: Function 'MulBackward0' returned nan values in its 0th output.

As a note, I printed the min o value before clamping and I see it even received the NaN value even before passing to the log function.

albanD · August 14, 2020, 4:05pm

Thanks for the trace. Just wanted to confirm that we’re not looking at the wrong place in the code!

Ho so the forward value is Nan already?

Ahmed_Abdelaziz · August 14, 2020, 4:10pm

Thanks for your quick response. Yes And I don’t know how this is possible. I assumed that anomaly detection find the first occurrence of NaN and reports it, doesn’t it?

Any better way to find the first nan in network? Here is the warning from the second error

Warning: Error detected in MulBackward0. Traceback of forward call that caused the error:
  File "train.py", line 339, in <module>
    main(args)
  File "train.py", line 133, in main
    train_joint
  File "train.py", line 202, in train
    p, planlogits = graph_model(batch)
  File "~/anaconda3/envs/torch-sci/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "~/torch-sci/GraphWriter/models/newmodel.py", line 158, in forward
    z = (1-s)*z

I can keep tracking it back that way, It’s kinda time consuming because I have to run my model for couple of iterations until I get the next error

albanD · August 14, 2020, 4:18pm

It does, but it can only look at the backward pass. So if nans happen in the forward, they won’t be reported.
You will need to add prints directly in your code to check where the nan appears.
In particular, you can check your_tensor.isnan().any() ? To check that in the forward.

Ahmed_Abdelaziz · August 14, 2020, 4:20pm

I will try to debug my forward path then, thanks for pointing this out!

Ahmed_Abdelaziz · August 14, 2020, 5:41pm

Hi @albanD, I figured the nan source in the forward pass, It’s a masked softmax that uses -inf to mask the False values, but I guess I have many -infs that’s why it can return nan.

    unnorm.masked_fill_(emask,-float('inf'))
    attn = F.softmax(unnorm,dim=2)
    out = torch.bmm(attn,emb)

I tried the below line as alternative, but the values that should be masked have values near the others that shouldn’t. Any ideas here?

    unnorm.masked_fill_(emask,-1 * 1e-15)

albanD · August 14, 2020, 7:54pm

1e-15 is practically the same as inf for float32.
You can try 1e-6 to avoid nan (this is the best precision you can have for a float).

ali_safaya · January 31, 2021, 6:36pm

Hi,

I have a similar problem, and since I am too late, I am only writing this for future reference. The proper value is not -1e-15 or -1e-6, I think you meant to use -1e6 since this is approximately -Inf.
but -1e-15 is approximately zero.

Hajar_Merbouh · June 25, 2022, 3:16pm

Hi @Ahmed_Abdelaziz , I am having the same issue while using a masked softmax that uses -inf to mask the false values. Did you find a solution to the given error ?