Exception in BCECriterion.cu:42

cherepanovic · April 14, 2020, 3:05pm

Is someone aware of this exception? Raised after 70th epoch (it does not depend on epoch… did it again and got the exception after 40th epoch)

Train Epoch: 70 [0/6742 (0%)]   Loss: -431231.800000
/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THCUNN/BCECriterion.cu:42: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::detail::tuple_of_iterator_references<thrust::device_reference<float>, thrust::device_reference<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [4,0,0], thread: [224,0,0] Assertion `input >= 0. && input <= 1.` failed.

Thanks

kshitij · April 14, 2020, 3:30pm

Looks like at some point in training the model is outputting value outside the range of 0 to 1. Are you sure that the predicted value from the model are in range 0 to 1?

Can you put a try-except to see the value of the model prediction when Error is raised?

cherepanovic · April 14, 2020, 7:28pm

I used the MNIST dataset. The values are between 0 and 255 (included).

the predicted values have such values (a snippet)

    0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
    5.4309e-15, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
    0.0000e+00, 7.7825e-17, 1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00,
    1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00,
    1.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
    0.0000e+00, 2.2867e-38, 1.7363e-37, 9.8193e-22, 9.3205e-27, 0.0000e+00,

cherepanovic · April 14, 2020, 7:49pm

you are right my loss function is the F.binary_cross_entropy.

kshitij · April 14, 2020, 8:30pm

The predicted value for the model should be between 0 and 1.

import torch
import torch.nn.functional as F
preds = torch.ones(3) + 0.01
target = torch.ones(3)
F.binary_cross_entropy(preds, target) # Crashes as preds is not between 0 and 1.

I suspect this is what is happening in the epoch in which it is crashing i.e. preds is not in range 0 and 1

For the epoch it crashes can you do,

try:
    # training step
except:
    mask = (preds > 1) + (preds < 0) # see if any of the value in mask is True
    print(mask.sum()) # If non-zero, then predictions are out of the range.

Most probably you have done this but do verify if you have applied sigmoid before passing the preds to the model.

Or you can use https://pytorch.org/docs/stable/nn.functional.html#binary-cross-entropy-with-logits

Hope this helps.

cherepanovic · April 14, 2020, 9:50pm

Or you can use torch.nn.functional — PyTorch 2.1 documentation

did not help, loss function returened a nan

tensor(nan, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>)

cherepanovic · April 14, 2020, 9:59pm

F.binary_cross_entropy(preds, target) # Crashes as preds is not between 0 and 1.

not the same crash: BCECriterion.c:62 (in my case it was BCECriterion.cu:42)

but both of them reveal the wrong input.

thank you!

cherepanovic · April 14, 2020, 11:28pm

the funny fact is that the last layer of my model is already a sigmoid function


return F.sigmoid(self.fc6(h))

that means that the results should be within the required range

kshitij · April 15, 2020, 8:16am

Surprising to see the error.

cherepanovic · April 15, 2020, 9:01pm

(btw) I tried the same execution with smaller learning rates 1e-4/1e-5/1e-6 over 150 iterations and didn’t get any errors.

Still waiting for help regarding this issue.

(my replicated post in github https://github.com/pytorch/pytorch/issues/36647)