Binary classification model outputs low probabilities

Hi, I have binary classification model for detecting anomalies i data.
The problem is, it outputs low probabilities (0.2 at maximum) when I expect from it to output 0.5 and more for related events.
Events that I looking for is sparse so they appear rarely in train data. Can it be “unbalanced data” problem (I read somewhere about it).
But I kind of want for network to learn on all negative examples as well.
Am I doing something wrong?
I am using BCELoss, Adam optimizer, and my model looks like this:

BinaryClassification(
  (gru): GRU(38, 1024, batch_first=True)
  (fc_layer): Sequential(
    (0): Linear(in_features=1024, out_features=512, bias=True)
    (1): SELU()
    (2): Linear(in_features=512, out_features=1, bias=True)
    (3): Sigmoid()
  )
)

Maybe there is some different approach to solve such problems?

Hi Nyakov!

Yes, it very well could be. If you train on a lot of negative samples and
very few positive samples, your network can get trained to (almost)
always predict “negative” without really “learning” anything about how
your “negative” and “positive” samples differ and still get a low loss
function because blindly predicting “negative” will almost always be
correct.

My preferred approach is to sample from the training set with positive
samples weighted enough more heavily that a typical batch has about
equal numbers of positive and negative samples.

A second approach – which I would use if the number of positive samples
in your training set is so small that the first approach frequently results in
duplicate positive samples in a batch – is to use BCEWithLogitsLoss’s
pos_weight constructor argument to weight the positive samples more
heavily in the loss function.

As an aside, you should be using BCEWithLogitsLoss (instead of
BCELoss) without the final Sigmoid layer as it has better numerical
stability.

Best.

K. Frank

1 Like

Thank you, pos_weight worked for me, although learning error go up, but it seems model is working.

Sampling is not the option for me, because positive samples is sparse (62 negative to one positive) and also I use one big tensor from with I take data by sliding window, sampling will up RAM consumption to high for my system, or makes things to complex for the moment.

I used (negative_samples / positive_samples) as a value for pos_weight, and it works for now, maybe it is a bit extreme?

Hi Nyakov!

This is the most sensible value of pos_weight to start with unless you have
some specific reason to use something different.

The next thing you can do – if you care – is to look at your false-negative and
false-positive rates after you have trained. You might want them to be about
the same (if they aren’t) or you might care about improving one at the cost of
degrading the other.

If you want to decrease your false-negative rate – that is, correctly predict
“positive” for more of your actual positive inputs – you would train with a
larger value of pos_weight (but doing so would typically come at the cost
of an increased false-positive rate).

Best.

K. Frank

Sorry for noob question.
But when I have multiple classes and use BCEWithLogitsLoss as loss function, when I evaluate the model should I apply Softmax to the output or Sigmoid ?

Hi Nyakov!

Do you have what I would call a single-label, multi-class problem, where
each sample is labelled with exactly one of multiple labels? Or do you have
a multi-label, multi-class problem, where each sample is labelled with none,
some, or all of the labels, all at once?

In both cases, your final layer should be a Linear with out_features equal
to the number of classes (and no subsequent layer – neither Softmax nor
Sigmoid). But for a single-label problem you would use CrossEntropyLoss
as your loss criterion, while for a multi-label problem, you would continue to
use BCEWithLogitsLoss.

(Internally, CrossEntropyLoss applies log_softmax() and
BCEWithLogitsLoss applies logsigmoid(), so you are, in effect,
using softmax() and sigmoid() for the single-label and multi-label
case, respectively, but don’t use them explicitly, as they’re built into
the loss criteria.)

Best.

K. Frank

I have two labels for each sample that can be either 1 or 0, technically they cannot be both 1 at the same time but they can be 0 at the same time.
Its dog/cat probability or neither of both.
I used two separate networks before (one for dogs and one for cats) but now decided to merge them into one.

Ok, got it, so if I use BCEWithLogitsLoss and after training, when I use model for classification, and want to get probabilities from It, I should apply sigmoid() to model outputs?

Hi Nyakov!

If I understand you correctly, a given sample could be a dog or a cat
or neither, but not both.

That is, you might have a picture of a dog or a cat or an empty scene
or a bunny, but you would not have a picture containing both a dog and
a cat.

I would treat this as a single-label, three-class problem. Your model
should have three outputs, that is, its final layer should be a Linear
with out_features = 3, and the three outputs represent the predicted
(unnormalized) log-probabilities for each of the three classes.

You should then use CrossEntropyLoss as your loss criterion.

If you need probabilities (but you don’t always need them), you would
apply softmax() to the output of your model to convert the unnormalized
log-probabilities to a proper probability distribution over the three classes
where the three probabilities sum to one.

Best.

K. Frank