Torch 1.6.0 RuntimeError: probability tensor contains either `inf`, `nan` or element < 0, But good with Torch 1.1.0

Shizhe_Cai · August 30, 2020, 10:14am

I’m runing this piece of code:

    def forward(self, inputs, core_state=()):
        x = inputs["frame"]
        # time x batch x 64 x 64 x 3
        T, B, *_ = x.shape

        # merge time and batch
        # [T*B x 64 x 64 x 3]
        x = torch.flatten(x, 0, 1)

        x = x.float()

        # [T*B x 3 x 64 x 64]
        x = x.transpose(1, 3)

        # x = checkpoint_sequential(self.feat_extract, 2, x)
        x = self.feat_extract(x)
        x = x.view(T*B, -1)

        # core_input = checkpoint_sequential(self.fc, 2, x)
        core_input = self.fc(x)

        core_output = core_input
        core_state = tuple()

        policy_logits = self.policy(core_output)
        baseline = self.baseline(core_output)
        if self.training:
            action = torch.multinomial(F.softmax(policy_logits, dim=1), num_samples=1)
        else:
            action = torch.argmax(policy_logits, dim=1)

        policy_logits = policy_logits.view(T, B, self.num_actions)
        baseline = baseline.view(T, B)
        action = action.view(T, B)

        return dict(policy_logits=policy_logits, baseline=baseline,
                    action=action), core_state

And got this error:

Traceback (most recent call last):
File "/usr/local/easybuild-2019/easybuild/software/compiler/gcccore/8.3.0/python/3.7.4/lib/python3$
self.run()
File "/usr/local/easybuild-2019/easybuild/software/compiler/gcccore/8.3.0/python/3.7.4/lib/python3$
self._target(*self._args, **self._kwargs)
File “/data/gpfs/projects/punim1126/CL-RIDE-master-project/src/utils.py”, line 335, in act
raise e
File “/data/gpfs/projects/punim1126/CL-RIDE-master-project/src/utils.py”, line 287, in act
agent_output, agent_state = model(env_output, agent_state)
File “/home/shizhec/.local/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 722, in _$
result = self.forward(*input, **kwargs)
File “/data/gpfs/projects/punim1126/CL-RIDE-master-project/src/models.py”, line 675, in forward
action = torch.multinomial(F.softmax(policy_logits, dim=1), num_samples=1)
RuntimeError: probability tensor contains either inf, nan or element < 0

I’m using torch 1.6.0, however, when I use torch 1.1.0, I don’t get any error anymore, and I could train the model correctly. Anyone know why this is happening?

iffiX · August 30, 2020, 2:29pm

Output of policy_logits?

Shizhe_Cai · August 30, 2020, 2:58pm

#one output before error:
tensor([[ 4.0024, -22.5107, 8.6548, -26.3529, 199.5710, -23.9216, -40.0046,
3.1537, -19.8343, -39.2633, 34.4721, -7.2311, -56.9415, -5.0400,
15.7165]])
#where the error occur:
tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])

Shizhe_Cai · August 30, 2020, 3:10pm

for torch==1.1.0:
I never got those nan tensors, all process works fine. This is strange.

iffiX · August 31, 2020, 2:03am

You code seems OK, something might have changed between these two versions, so you can either:

Modify you framework, then track and find the origin of the first NaN value
Switch back to 1.1.0, for now.

I would recommend you use the second solution, according to your complex situation presented in the previous question.

iffiX · August 31, 2020, 2:05am

NaN could really happen at anywhere, mainly:

division by 0
something involving the annoying log/exp calculation, like log probability, I have just located a Nan problem myself this morning. Eg:

   a=Normal(1, 1e-23)
   a.log_prob(a.sample()) -> NaN because sigma is too small.

Aayush_Shah · September 28, 2023, 9:57am

Have you tried increasing the temperature?

Well try increasing the temperature value. I had very low temperature value along with other parameters such as top_k and top_p which made the next token distribution too steep and as the beam search’s logic, you will need to have multiple tokens available, and in the low temperature case I couldn’t have (because we know how temperature works, right?)

So I increased the temperature and it worked.

Try increasing the temp value and it should just work, if there are no other complexity involved.