Device-side assert triggered when using binary_cross_entropy loss

tztztztztz · May 16, 2018, 1:01pm

I got `Runtime Error: cudaEventSynchronize in future::wait device-side assert triggered ’ when I use binary_cross_entropy

I think this is because the input of the BCELoss must fall into the range of [0,1].

my input is a product of two softmax, so, in theory, the product will never greater than 1.

I think this my be related to floating-point precision ?

and if so, how can I solve this problem.

can anyone help me ? thank you !

here is my code

cls_prob = F.softmax(cls_score, dim=1)
det_prob = F.softmax(det_score, dim=0)
predict = F.mul(cls_prob, det_prob)
loss = F.binary_cross_entropy(predict, label, size_average=False)

albanD · May 16, 2018, 1:05pm

Hi,
Can you run your script with CUDA_LAUNCH_BLOCKING=1 and see what is the error message that is printed please.

tztztztztz · May 16, 2018, 1:20pm

Sorry, I think I missed some specific code information.
here is my complete code


import torch
from wsddn.roi_pooling.modules.roi_pool import RoIPool
from wsddn.utils.network import FC
from wsddn.utils import network
import torch.nn.functional as F
import torch.nn as nn
from wsddn.vgg16 import VGG16


class WSDDN(nn.Module):
    feature_scale = 1.0 / 16
    n_classes = 21

    def __init__(self, classes=None):
        super(WSDDN, self).__init__()
        if classes is not None:
            self.classes = classes
            self.n_classes = len(classes)

        self.features = VGG16()
        self.roi_pool = RoIPool(7, 7, self.feature_scale)
        self.fc6 = FC(512 * 7 * 7, 4096)
        self.fc7 = FC(4096, 4096)
        self.classifier_head = FC(4096, self.n_classes, relu=False)
        self.detection_head = FC(4096, self.n_classes, relu=False)

        self._loss = None
        self._detection = None

    def forward(self, im_data, rois, labels):
        im_data = network.np_to_variable(im_data, is_cuda=True)
        im_data = im_data.permute(0, 3, 1, 2)
        rois = network.np_to_variable(rois, is_cuda=True)
        labels = network.np_to_variable(labels, is_cuda=True)
        features = self.features(im_data)

        pooled_features = self.roi_pool(features, rois)
        x = pooled_features.view(pooled_features.size()[0], -1)
        x = self.fc6(x)
        x = F.dropout(x, training=self.training)
        x = self.fc7(x)
        x = F.dropout(x, training=self.training)
        cls_score = self.classifier_head(x)
        det_score = self.detection_head(x)

        cls_predict = F.softmax(cls_score, dim=1)
        det_predict = F.softmax(det_score, dim=0)
        predict = F.mul(cls_predict, det_predict)
        y_predict = predict.sum(dim=0)
        y_predict = y_predict[1:]

        self._loss = self.build_loss(y_predict, labels)
        self._detection = predict

        return y_predict

    @property
    def detection(self):
        return self._detection

    @property
    def loss(self):
        return self._loss

    def build_loss(self, y_predict, labels):
        loss = F.binary_cross_entropy(y_predict, labels, size_average=False)
        # y_predict = torch.clamp(y_predict, min=1e-4, max=1 - 1e-4)
        # loss = -1 * torch.log(labels * (y_predict - 1.0 / 2) + 1 / 2).sum()
        return loss

and the weird thing is, this problem will occur after training for about 10000 iters, so I’m waiting for the problem now

tztztztztz · May 16, 2018, 1:55pm

hi, this is the error message

/pytorch/torch/lib/THCUNN/BCECriterion.cu:30: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::tuple<float, float, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [0,0,0], thread: [14,0,0] Assertion `input >= 0. && input <= 1.` failed.
Traceback (most recent call last):
  File "/home/tz/projects/wsdnn_pytorch/train.py", line 86, in <module>
    predict = net(im_data, prior_boxes, gt_classes)
  File "/home/tz/anaconda2/envs/dl-python3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tz/projects/wsdnn_pytorch/wsddn/wsddn.py", line 53, in forward
    self._loss = self.build_loss(y_predict, labels)
  File "/home/tz/projects/wsdnn_pytorch/wsddn/wsddn.py", line 67, in build_loss
    loss = F.binary_cross_entropy(y_predict, labels, size_average=False)
  File "/home/tz/anaconda2/envs/dl-python3/lib/python3.5/site-packages/torch/nn/functional.py", line 1200, in binary_cross_entropy
    return torch._C._nn.binary_cross_entropy(input, target, weight, size_average)
RuntimeError: after cudaLaunch in triple_chevron_launcher::launch(): device-side assert triggered

thank you for your help!!!

albanD · May 16, 2018, 2:16pm

From the error message it seems that the input of your BCE loss is not between 0 and 1. The input you give should represent the probability of label 1, so it should be between 0 and 1.

tztztztztz · May 16, 2018, 2:19pm

I agree with you.
so Can I just simply use torch.clamp to restrict the input?
I think the reason why the input doesn’t fall into range [0, 1] is the float-point precision

albanD · May 16, 2018, 2:25pm

If it is floating-point precision error, then clamping will work, or adding the minimum and dividing by the max.
But first I would make sure that this is a precision problem, basically do this fix only if you’re close enough to either 0 or 1. Otherwise raise an error.

tztztztztz · May 16, 2018, 3:21pm

yes, thanks for your concrete reply!!

herleeyandi · January 23, 2020, 10:24am

Hi @albanD I still found the similar issue in newest pytorch version (stable 1.4). I hope this issue will be fix soon in the next pytorch version.

albanD · January 23, 2020, 3:05pm

This issue is due to user error (giving unexpected input to a function), not from pytorch’s side.

herleeyandi · January 23, 2020, 4:15pm

@albanD No I am doing like this.

criterion = nn.BCELoss() 
pred = torch.sigmoid(pred)
loss = criterion(pred, target)

It still giving error, but if I add clamp the error resolved.

criterion = nn.BCELoss() 
pred = torch.clamp(torch.sigmoid(pred),0,1)
loss = criterion(pred, target)

Which means the output of sigmoid is not in range 0 and 1, or maybe because of the precision problem. However suppose if I implement the attention module, which use sigmoid to produce [0-1] range, it will has problem because maybe the result is not pure 0 or 1 in range.

ptrblck · January 24, 2020, 6:15am

Could you please post some values for pred which produce this error?

herleeyandi · March 15, 2020, 2:26pm

@ptrblck Hello I am sorry for the late reply. After doing debugging for several months, finally I know the main problem. This long time debugging happened because the error is only shows in specific time so I need to run the training again and again to get exactly the error. I wait the error for coming but it just come again several weeks ago. The problem is caused by the Nan value of the prediction. This makes sense that the error not always happened, where its depends on your model performance. Actually the error saying that the value is not between 0 and 1, in fact it is Nan. So I think next time its better to detect the Nan value before calculate the loss. Use pytorch function torch.isnan to make sure the prediction is not Nan. Also I suggest the pytorch should can produce the Nan error instead of only showing error message value not between 0 and 1.

ptrblck · March 15, 2020, 11:14pm

Thanks for the update. I like the suggestion about printing the actual invalid value.
Would you like to open a GitHub issue with this feature request?

Charlene9698 · November 8, 2020, 2:55am

I face the same problem as you. It takes very long to debug because it only happens now and then. Do you know why there could be NaN value in prediction and how to prevent that from happening?

xiaoy · November 11, 2020, 7:05am

I am facing the same trouble as the original author posted. I multiply the results of two softmax outputs (softmax over two different dimentions). Then I sum the tensor over one dimention to get the final output scores, say a 20-d tensor. Here is the output score which triggers the cuda AssertionError, specifically one value 1.0000e+00, which in theory should not happy.

I assume this is related to floating-point precision error. This error is not stable to reproduce. I got the error sometimes around 3k steps and sometimes after 10k during training.

Does it imply that we should clamp the tensor whenever we use the binary_cross_entropy_loss? I think it might be a good idea to log what value is actually causing the AssertionError.

tensor([9.4490e-05, 1.3122e-06, 1.9130e-03, 1.1611e-04, 3.1499e-05, 7.9529e-05,
        5.0480e-05, 1.0000e+00, 2.0515e-04, 1.4706e-06, 3.1726e-05, 1.7213e-09,
        8.1568e-05, 6.2557e-06, 1.4758e-06, 2.2086e-04, 1.9921e-04, 7.1404e-05,
        6.8685e-06, 1.0655e-04], device='cuda:0', grad_fn=<SumBackward1>)

cls_prob = F.softmax(cls_score, dim=1) # across classes [2000,20]
det_prob = F.softmax(det_score, dim=0) # across proposals/detections [2000,20]
predict = F.mul(cls_prob, det_prob) # shape: [2000,20]
pred_class_scores = sum(predict, dim=0) # [20]
loss = F.binary_cross_entropy(pred_class_scores, label, size_average=False)

ptrblck · November 12, 2020, 8:37am

Your code might create values larger than 1. due to the limited floating point precision as seen here:

torch.manual_seed(8)

cls_score = torch.randn(2000, 20, device='cuda')
cls_score[:, 19] = 100.
det_score = torch.randn(2000, 20, device='cuda')

cls_prob = F.softmax(cls_score, dim=1) # across classes [2000,20]
det_prob = F.softmax(det_score, dim=0) # across proposals/detections [2000,20]
predict = torch.mul(cls_prob, det_prob) # shape: [2000,20]
pred_class_scores = torch.sum(predict, dim=0) # [20]
print((pred_class_scores > 1.).any())
> tensor(True, device='cuda:0')

print(pred_class_scores[19] - 1.)
> tensor(1.1921e-07, device='cuda:0')

so I think you should clamp the values before passing them to the loss function.

xiaoy · November 13, 2020, 2:00am

Thank you for creating the example! It helps me a lot.

LLYXC · November 24, 2020, 3:03am

Hi, I meet the same problem as posted here. I checked the value of the results after the multiplication of the two scores (computed by softmax), and sometimes it did gives values larger than 1. It seemed truly a precision problem.

I check the solution of a GitHub repo (https://github.com/NVlabs/wetectron/blob/master/wetectron/modeling/roi_heads/weak_head/loss.py). The solution proposed in this repo is simply clamping the scores. I think clmap the values will cause zero gradient during back propagation, but it seems there is no other solutions right mow.