Crash in BCEWithLogitsLoss

madshi · April 25, 2023, 5:20pm

I’m training a network that consists of convolutional layers, which in the end outputs a single binary value (1.0 = true, 0.0 = false). I’m training used mixed precision, using Linear layers at the end, with BCEWithLogitsLoss, with Adam optimizer, using PyTorch 1.12.1.

Training works fine, generally. But completely randomly, in the middle of the training, training sometimes suddenly aborts with the following error:

RuntimeError: Subtraction, the -operator, with two bool tensors is not supported. Use the^orlogical_xor() operator instead.

This happens in “return torch.binary_cross_entropy_with_logits”.

Am I doing something wrong? Since the error happens randomly, the network seems to be fine, generally. So can it still be my fault? Any tips?

Network structure looks something like this:

        data   # image with 128x80 resolution
        data = self.convs(data)  # some convolutional layers
        data = F.interpolate(data, scale_factor = 1.0 / 2.0, mode="bilinear", align_corners=False)  # 64x40
        data = self.convs(data)
        data = F.interpolate(data, scale_factor = 1.0 / 2.0, mode="bilinear", align_corners=False)  # 32x20
        data = self.convs(data)
        data = F.interpolate(data, scale_factor = 1.0 / 2.0, mode="bilinear", align_corners=False)  # 16x10
        data = self.convs(data)
        data = F.interpolate(data, scale_factor = 1.0 / 2.0, mode="bilinear", align_corners=False)  # 9x5
        data = self.convs(data)
        data = F.interpolate(data, scale_factor = 1.0 / 2.0, mode="bilinear", align_corners=False)  # 4x2
        data = data.view(data.shape[0], -1)
        data = self.linearLayers(data)  # uses Linear layers to get down to 1 element
        return data

ptrblck · April 25, 2023, 6:22pm

The error can be raised if you are passing the model output or target as a BoolTensor to nn.BCEWithLogitsLoss as seen here:

criterion = nn.BCEWithLogitsLoss()

output = torch.randn(10, 1, requires_grad=True)
target = torch.randint(0, 2, (10, 1)).float()

# works
loss = criterion(output, target)

# fails
loss = criterion(output, target.bool())
# RuntimeError: Subtraction, the `-` operator, with two bool tensors is not supported. Use the `^` or `logical_xor()` operator instead.

loss = criterion(output.bool(), target)
# RuntimeError: Negation, the `-` operator, on a bool tensor is not supported. If you are trying to invert a mask, use the `~` or `logical_not()` operator instead.

Based on the error message it seems the target tensor is a BoolTensor so check how it’s created and make sure it’s a floating point tensor using the same dtype as the model output.

madshi · April 25, 2023, 7:02pm

Thanks for your reply! I’ve added a “print” instruction like this:

print(networkOutputTensor, labelTensor)
loss = self.loss_fn(networkOutputTensor, labelTensor)

Here’s the output:

tensor([-5.6328, -5.1094, -6.1445,  9.6719, -5.9961, 13.7109, -5.1367, 32.8438,
         3.5176, -4.2031, 13.0859,  0.6255, 29.5312, 16.6406, -6.0039, -4.9219],
       device='cuda:0', dtype=torch.float16, grad_fn=<SelectBackward0>)
tensor([0., 0., 0., 1., 0., 1., 0., 1., 1., 0., 1., 0., 1., 1., 0., 0.],
       device='cuda:0', dtype=torch.float64)

So it would seem both tenstors are float. Still, I randomly get that runtime error (mentioned above) in the middle of training. The loss function is “BCEWithLogitsLoss()”.

Any ideas?

Edit: At some place later in the code I have this, though:

        for batch in range(networkOutputTensor.shape[0]):
          if (torch.sigmoid(networkOutputTensor[batch]) > 0.5) and (labelTensor[batch] <= 0.5):
            falsePositives += 1
          elif (torch.sigmoid(networkOutputTensor[batch]) <= 0.5) and (labelTensor[batch] > 0.5):
            notDetected += 1

But this shouldn’t be a problem, right? The runtime error callstack clearly points to the loss function.

thecho7 · April 26, 2023, 12:22am

Would you share the message you got?

ptrblck · April 26, 2023, 1:00am

Since the error seems to be raised “randomly” I would assume that the print statement would show a BoolTensor right before the failure occurs.
Also, make sure you are checking the right line of code as sometimes these issues are caused in e.g. the validation or testing loop which might look “random” as these might be triggered at a specific interval.

madshi · April 26, 2023, 8:44am

Oh wow, you’re right!! I’ve run training with the “print” until it fails, and this is the output:

tensor([ 4.9922,  7.8398,  7.8320,  5.6445,  3.8691, 11.3438,  7.5859, 12.3906,
         8.8203, 12.3281,  5.8672, 12.6562,  6.7852,  6.2578, 13.4375,  5.0234],
       device='cuda:0', dtype=torch.float16, grad_fn=<SelectBackward0>)
tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True], device='cuda:0')
Traceback (most recent call last):
  File "train.py", line 124, in <module>
    train(0, 1);
  File "train.py", line 93, in train
    falsePositives, notDetected = model.train(images[:, :, :, 0:3], images[:, :, :, 3:6], images[:, :, :, 6:9])
  File "/home/ubuntu/classify/model/classify.py", line 96, in update
    loss_G = self.loss(modelOutput, label)
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/nn/modules/loss.py", line 714, in forward
    return F.binary_cross_entropy_with_logits(input, target,
  File "/usr/lib/python3/dist-packages/torch/nn/functional.py", line 3150, in binary_cross_entropy_with_logits
    return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
RuntimeError: Subtraction, the `-` operator, with two bool tensors is not supported. Use the `^` or `logical_xor()` operator instead.

So it does seem that the labels are sometimes Bools, and I don’t know why. Usually, they’re Floats. Will have to dig deeper to figure out why they’re sometimes Bools.

Thanks a lot for your help!!!

ptrblck · April 26, 2023, 8:53am

Good to hear my assumption was correct. Let us know once you were able to narrow down the issue and the operation creating these BoolTensors (e.g. if the Dataset created them or another operation).

madshi · April 26, 2023, 9:14am

I’ve found the issue: One branch in my dataset produced bool labels…

Now I’m embarrassed. Thanks a ton for your on-point feedback. You insisting that bool tensors must be involved was key to solving the issue.