RuntimeError: Function 'MkldnnRnnLayerBackward0' returned nan values in its 1th output when using set_detect_anomaly True

OkabeSan · September 24, 2024, 10:25am

Hi.

When I am running my RL project, it gives me nan (The Error below) after a few iterations while I clipped the gradient of my model using this:
torch.nn.utils.clip_grad_norm_(self.critic_local1.parameters(), max_norm =4)

The Error:

*ValueError: Expected parameter probs (Tensor of shape (1, 45)) of distribution Categorical(probs: torch.Size([1, 45])) to satisfy the constraint Simplex(), but found invalid values:*
*tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]], grad_fn=<DivBackward0>)*

So I used torch.autograd.set_detect_anomaly(True) to detect where is the anomaly and it says:
Function 'MkldnnRnnLayerBackward0' returned nan values in its 1th output
I did not find it anywhere what is this error and MkldnnRnn and what is the root of the error nan? Because I thought that the error nan should be solved when we clip the gradients.

The issue is that the code runs without errors on my laptop, but it raises an error when executed on the server. I don’t believe this is related to package versions.