[solved] Possible reasons for nans in 0.4.0?

I’m running a convnet, and getting nans. However, there are no obvious divisions, so it’s unclear to me how the nans could be forming. As far as I understand it, there are two ways to obtain a nan result:

  • divide 0 by 0. (dividing non-zero by 0 gives inf; dividing 0 by non-zero gives 0; dividing 0 by 0 gives nan)
  • the result of pretty much any function for which any of the inputs are nan

I have the following statement in my code:

qtargets = qtargets + p.discount_factor * qmax_next * qvalues_next_mask

Yes, it’s an RL thing.

The result of this flows into a loss criterion. When the loss is nan, I print out a bunch of diagnostic information, including:

 print('qtargets is nan? ', math.isnan(qtargets.sum().item()))
 print('qvalues_next_mask is nan? ', math.isnan(qvalues_next_mask.sum().item()))
 print('qvalues_next is nan? ', math.isnan(qvalues_next.sum().item()))
 print('qmax_next is nan? ', math.isnan(qmax_next.sum().item()))
 df_qm = p.discount_factor * qmax_next
 print('df_qm is nan? ', math.isnan(df_qm.sum().item()))
 qm_qvnm = qmax_next * qvalues_next_mask
 print('qm_qvnm is nan? ', math.isnan(qm_qvnm.sum().item()))
 df_qm_qvnm = p.discount_factor * qmax_next * qvalues_next_mask
 print('df_qm_qvnm is nan? ', math.isnan(df_qm_qvnm.sum().item()))

The results of this is somewhat stochastic, but includes for example:

qtargets is nan?  True
qvalues_next_mask is nan?  False
qvalues_next is nan?  False
qmax_next is nan?  False
df_qm is nan?  False
qm_qvnm is nan?  True
df_qm_qvnm is nan?  True

p.discount_factor is a scalar float (=0.5). These tensors are all torch cuda tensors.

How can qmax_next * qvalues_next_mask be nan, when they themselves are each non-nan?

Ok. Finally tracked down the bug, so marking solved. (I’m not sure why some of the tensors above are showing as not nan, but it’s now theoretical, since fixing the bug means the problem above no longer manifests itself. The bug was that I had a tensor where some rows have data, and the other rows were undefined. I then multipled this by a binary mask, and added it to another tensor. However, zero times an undefined value gives nan whenever the undefined value is nan…).

1 Like

What do you mean that other rows were undefined?

eg if you create the tensor usnig torch.empty, then the contents of the cells are undefined. undefined can be anything, including nans.

(so, ever since, I just always use torch.zeros, instead fo torch.empty, to create tensors)

2 Likes

In case you landed here because of nan values in tensors but without using torch.empty():

In my case the nan got introduced in a torch.nn.functional.softmax(). The softmax will produce a nan if there is only one element and this has a value of -inf (in my case I mask out values by setting them to -float('inf'))

It happens only though if there is only one element in the dimension you apply the softmax. Example:

softmax = torch.nn.functional.softmax(torch.tensor([[0.7], [-float('inf')], [0.5]]), dim=-1)

>>>tensor([[1.],
        [nan],
        [1.]])

Solving it is rather simple; either take another value for masking or just don’t apply a Softmax if your sequence has a length of 1.