Https://pytorch.org/docs/stable/autograd.html#torch.autograd.detect_anomaly

I am suspecting NaN values in my script so I would like to use the anomaly detector of pytorch. However, I am confused as to how exactly to use it.

My dataset class uses h5py so I am wondering where I have to include the context manager in in order for it to work.

Hi,

You need to wrap your forward and backward pass with it, so wherever you do that.
If you want it for the whole script, you can also add torch.autograd.set_detect_anomaly(True) at the beginning of your script and it will stay on !

I dont have an explicit backward pass, only the forward method of my model, or do you mean the backward invocation in my training loop?

Will the inclusion of this line at the beginning of my file then also apply to the crucial part in which I load in hdf5 files and convert them into tensors? Because this is where I suspect NaN values to occur…

I meant the backward invocation in your training loop.

At the moment, the anomaly detection mode only detect nans that appear during the .backward() call as it is hard for the end user to poke around in what happens there.
I won’t detect nans that appear outside of the backward invocation.

If nans appear in your own code, You can add some check at few places in your code to find where they appear. To check if a tensor contains nans, you can check: if your_tensor.ne(your_tensor).any():

3 Likes

since NaN != NaN, I get it =)

thanks a lot!

How expensive is the detect_anomaly to run during training? Unsignificant right :D?

It might be expensive, so use it for debugging only.

1 Like

Actually very significant :smiley: It does a lot of bookkeeping and checks the return values of each low level function for nans.
As mentioned above, you should only use it for debugging.

Great, thanks. I’ll keep to checking the loss then

assert not torch.isnan(loss), 'nan err msg'