Assertion `input_val >= zero && input_val <= one` failed

it looks like the parameters in my convloutional kernals have gone to either 0 or Nan eg,

tensor([ nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan,
-0.0745, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan,
nan, -0.2387, -0.0190, -0.0532, nan, nan, nan, nan,
nan, nan, nan, -0.0104, nan, nan, nan, nan,
0.0650, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, -0.0849, nan, nan, nan, nan,
nan, nan, nan, nan, nan, -0.0302, nan, nan,
nan, nan, nan, nan, nan, nan, nan, -0.0568,
nan, nan, nan, nan, nan, nan, nan, nan],
device=‘cuda:0’, grad_fn=),

is this caused by the data size been too small?

No, the size of the dataset should not cause invalid values.
NaN values in parameters are most likely caused by NaNs in the gradient, which might be caused by an exploding loss or e.g. invalid input values.

1 Like

The solution seems to be to reduce the learning rates. so far with the same dataset and it is at 73 epochs and rising…

3 Likes

I was writing my solution as you were responding…

By reducing the learinging rates the gradients should be more managable…

1 Like

Hi, In my code, the problem also not happened in first epoch, all answers about the label out of boundary or not batchnorm. But im sure i use batchnorm in every Conv layer. So you sure the problem comes from cpu?This problem confuse me a lot…thank you.

Hi @ptrblck, I’ve ran into a similar issue except it is seldom reproducable. The model I made runs training and eval as expected, and there was one iteration during training which cause this assertion issue after 4 epochs. I pinpointed the exact iteration where the error occurs, but that iteration has since went back to normal. I am using cuda. Here is the core snippet where the assertion occurs…

def ava_pose_softmax_func(logits):
    pose_logits = nn.Softmax(dim=1)(logits[:, :13])
    interact_logits = nn.Sigmoid()(logits[:, 13:])
    logits = torch.cat([pose_logits, interact_logits], dim=1)
    logits = torch.clamp(logits, min=0., max=1.)
    return logits

def ava_pose_softmax_criterion(logits, targets):
    logits = ava_pose_softmax_func(logits)
    return F.binary_cross_entropy(logits, targets)

I’ve tried using anomaly detection both forward and backwards passes, and it redirected me to the following line in my NN

roi_slow_feats = [nn.AdaptiveAvgPool3d((1, self.roi_spatial, self.roi_spatial))(roi_slow_feats[idx, :, s_mask]) for s_mask, idx in zip(roi_slow_feats_nonzero, range(len(roi_slow_feats_nonzero)))]

roi_slow_feats is the only term with a gradient, this list is then stacked using torch.stack. Sorry, this may be unclear, so please reach out for whatever info you need. Could it be a multiprocessing issue?

I’m unsure how the pooling layer would be related to the error (or multiprocessing) as the error is raised in the criterion which fails since the target has out-of-bounds values.
Add debugging pint statements or asserts and check when the targets tensor has invalid values.

Ah, so this assertion error is raised from targets and not logits? Could it be possible that a NaN exists in the logits?

I’ve added an assertion on targets. Here’s how they are created:

                if self.multi_class:
                    ret = torch.zeros(self.num_classes)
                    ret.put_(torch.LongTensor(label['label']), 
                            torch.ones(len(label['label'])))
                else:
                    ret = torch.LongTensor(label['label'])

                targets.append(ret)

label is of structure:
{‘tube_uid’: ‘50aaa5a7’, ‘bounding_box’: [0.4397682428483692, 0.42840391506329534, 0.45544815831916613, 0.4660857148557656], ‘label’: [3]}
for example

these targets are torch.stacked

You might be right that the labels are checked while the target could be invalid and might create e.g. a negative loss.
I didn’t realize your are using the functional API of nn.BCELoss and not nn.BCEWithLogitsLoss.

My bad, yeah we import torch.nn.functional as F. The dataset which this NN is built makes it so that we split its predictions into a softmax and sigmoid portion. In the future, we may only need sigmoid, so do you think implementing nn.BCEWithLogitsLoss may make training more stable? (granted if all the targets are valid)

nn.BCEWithLogitsLoss or the functional form via F.binary_cross_entropy_with_logits would give you more numerical stability. I don’t know what’s causing the issue currently, but it might be worth switching the criterion and pass the raw logits to it.
Note that your approach isn’t wrong in itself. I just didn’t pay attention to the criterion name and assumed you are using F.binary_cross_entropy_with_logits so was stressing out the target (as the logits are unbound) :wink:

2 Likes

Great! Thanks for the help. Sorry about being vague, I came across this assertion error during training and now it’s gone.

Added an assert on targets (and to catch the assertion from binary_cross_entropy), I’ll let you know if any progress is made on this issue. One thing on the back of my mind is: if my targets are created by the same function during training, is it possible for it to become invalid on epoch 5 after being fine in the first 4 epochs?

No, that shouldn’t be the case and I also think you were right that the inputs (i.e. model outputs) are containing the invalid values (not the targets).

1 Like

And invalid model outputs can be a plethora of issues… However, the fact that the issue occurred at epoch 5 could mean that this is a learning issue and possibly exploding gradient.

Will have to see what the logits/targets are once I catch an assertion error in the criterion.