What Loss function to use in Binary CNN Classification problem

auyer · June 14, 2019, 7:52pm

I am running a Transfer Learning scenario with a ResNet model. The original work was a classifier with hundreds of classes, and it used the CrossEntropyLoss function nn.CrossEntropyLoss().
A thread here suggest BCELoss, but there is BCEWithLogitsLoss that also seems fit.

In a confusion matrix, I want to optimize for the least amount of False Positives, even if it hurts my True Positive score.
What (of all) loss functions should I use ? And why ?

KFrank · June 15, 2019, 2:34pm

Hi Rafael!

For a binary classification problem, BCEWithLogitsLoss
should be your go-to loss function. (You would only want to
use BCELoss if your network naturally emits probabilities,
which it almost certainly doesn’t.)

Consider carefully the trade-off that you are implying here. As
an extreme, you could simply classify everything as “negative”.
You now achieve the “least amount of False Positives” (none
at all), but, of course your True Positive score is also really
bad, that is, zero.

So I assume your real trade-off is that you are willing to reduce
your True Positives some, but not too much, if you can get a
substantial reduction in your False Positives.

A sensible approach to achieve this would be to weight your
“negative” samples more heavily than your “positive” samples.
BCEWithLogitsLoss has a weight argument for this purpose.

The relative weight between your “negative” and “positive” samples
will determine how much you train your network to reduce your
False Positives at the cost of reducing your True Positives.

Good luck.

K. Frank

auyer · June 15, 2019, 7:52pm

Thanks ! Very informative.

I’m just trying to figure out what changes when using CrossEntropyLoss().cuda() and BCEWithLogitsLoss().cuda()

Just placing the BCE in place of the CE trows me this error:
ValueError: Target size (torch.Size([64])) must be the same as input size (torch.Size([64, 2]))

This is a sniplet of the training step:

    for i, (input, target) in enumerate(val_loader):
        target = target.cuda(async=True)
        input_var = torch.autograd.Variable(input, volatile=True)
        target_var = torch.autograd.Variable(target, volatile=True)

        # compute output
        output = model(input_var)
        loss = criterion(output, target_var) # <- error here!

Edit: More Information

Here are the two variables passed into criterion() :

(Pdb) output
tensor([[-0.2657,  0.1728],
        [ 0.3407, -0.6961],
        [ 0.8020, -0.8201],
        [ 0.1457,  0.0311],
        [-0.2517,  0.0223],
        [-0.1266, -0.3978],
        [ 0.4527, -0.6096],
        [ 0.2077, -0.1428],
        [-0.1205, -0.5252],
        [ 0.5462, -0.3988],
        [-0.1215, -0.1321],
        [ 0.3062, -0.5417],
        [ 0.0723, -0.0537],
        [-0.5435, -1.1898],
        [ 0.0718, -0.0986],
        [ 0.0118, -0.0860],
        [-0.0998, -0.8494],
        [-0.2591, -0.4207],
        [ 0.2687, -0.6160],
        [-0.2336, -0.4814],
        [-0.1896, -0.1463],
        [ 0.4623, -0.5179],
        [-0.3181, -0.3042],
        [-0.2550, -0.1824],
        [-0.6250, -0.1293],
        [-0.8920,  0.1077],
        [ 0.0013, -0.1081],
        [-0.2565, -0.0777],
        [-0.2360, -0.3112],
        [ 0.0615, -0.3419],
        [-0.4794, -0.1323],
        [-0.0624,  0.1003],
        [ 0.1803, -0.2833],
        [-0.0859,  0.0516],
        [-0.0256, -0.4226],
        [-0.6047, -0.3403],
        [ 0.2778, -0.6168],
        [ 0.0973, -0.3736],
        [-0.2165, -0.2941],
        [ 0.0252, -0.2497],
        [-0.1285, -0.3079],
        [-0.3292, -0.5657],
        [ 0.1660, -0.5869],
        [-0.1829, -0.3313],
        [-0.5305,  0.0671],
        [ 0.2120, -0.5442],
        [-0.1197, -0.0711],
        [ 0.2132, -0.5229],
        [-0.0977, -0.3243],
        [ 0.1694, -0.2342],
        [ 0.0137, -0.3607],
        [-0.3495, -0.2702],
        [ 0.3058, -0.8327],
        [ 0.4417, -0.7817],
        [-0.7523, -0.5299],
        [ 0.0826, -0.3280],
        [-0.4834, -0.4926],
        [-0.5763,  0.0012],
        [ 0.0992, -0.8658],
        [-0.1066,  0.4763],
        [-0.4472,  0.2544],
        [-0.3449, -0.1687],
        [-0.1852,  0.1073],
        [-0.0782, -0.5123]], device='cuda:0', grad_fn=<AddmmBackward>)

(Pdb) target
tensor([0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
        0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
        0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1], device='cuda:0')

KFrank · June 15, 2019, 8:24pm

Hello Rafael!

Without seeing the rest of your code – your model, in particular –
I’m guessing somewhat, but I believe that your problem could be
the following:

If you have a multi-class problem (where “multi” implies more
than two, and two classes is what we call a “binary” problem)
with nClass classes, then the output of your model should be
nClass logits. (This is what CrossEntropyLoss expects.)

In contrast, for a binary problem, the output of your model should
be a single logit (not two), conventionally taken to be the logit for
your “positive” class. This is what BCEWithLogitsLoss expects.

If you build your binary-problem model as a two-class multi-class
model, then you will (redundantly) have two logits as your output
(one for your “negative” class, as well as for your “positive” class).
This won’t match what BCEWithLogitsLoss expects, so I would
think you would get the error you report.

(You can build your binary model as a two-class multi-class
model, but then you should feed the model’s output into
CrossEntropyLoss. If you match everything up right, you
should get the same results as you would with a conventional
binary model feeding BCEWithLogitsLoss, but you probably
lose a more or less insignificant bit of efficiency.)

Best.

K. Frank