How to exploit BCELoss() criterion with no classes

MngFrc · April 18, 2020, 7:35am

Hi everyone!

I’m trying to deploy a linear feedforward NN that, given an input that represents the parameters of a system, outputs its error probability. The final aim is to look for the right input parameter tuning through an iterative way: I start with an input, and if the corresponding error probability is above a threshold, I reduce one of the input parameters until I find a satisfactory error probability [below the threshold].

So, my dataset consists on a set of measurements which couple system parameters (which are simply real numbers) and error probabilities. The NN should learn the relationship between the system setting and its error probability. So, my first idea is to use the measured error probabilities as labels (obviously they are real numbers in the set {0,1})

The structure of the NN is given: the input layer has the same dimensionality of the system parameters obviously, then there are a set of hidden layers with ReLu() activation function and finally there is one single-neuron output layer activated by a sigmoid function.

The criterion that has to be used is the BCELoss().

[ I know that it is suggested to use the nn.BCEWithLogitsLoss without sigmoid activation function, but the structure should be given, so I would prefer to not change it].

My problem is that I am not able to well understand why and how the BCELoss() criterion is well suited for this kind of problem.

I knew that it is was suggested when there is a classification problem and the number of classes is 2: class 0 and class 1. In such a scenario, the NN learns how to classify its input, outputting the probability that the input belongs to class 1.

But in my case, how should I train the network?

I think to have two possible approaches:

I forget about the concept of class and I train the NN supervising the error probability.
Since it is defined a threshold, which differentiate between acceptable and not acceptable error probabilities, I can use the same threshold to modify my labels in 0 and 1.

The problem of the first approach is that I think that the Loss can not converge to small numbers Indeed, just try to put in the BCE formula y=0.5 for example (where y is the label, the groundtruth). I have tried to train the NN and it reaches 0.4 of losses, even if the pattern of its curve is really good.

The problem of the second approach is that I don’t want my NN to tell if the error probability corresponding to its input is above or below a threshold, but I want to know how much the error probability is.

These last two points are my considerations, but they could be definitely wrong. That’s why I’m asking a more reliable opinion.

KFrank · April 19, 2020, 8:36pm

Hi MngFrc!

Various comments, in line, below:

Let me say what I think you’re trying to do, but please correct me
where I’m wrong.

But first, let me comment on some terminology to avoid confusion.

We often call the weights and biases in the layers of a neural network
(as well as other adjustable values) parameters. And we often speak
of training those parameters.

But you have a “system” that has “parameters” (that I assume are
not part of any neural network). Let me call these your system
parameters, or, sometimes, just parameters. And to avoid confusion,
let me call the parameters of the neural network the model weights,
or, sometimes, just the weights.

When we train model weights we use an iterative process – forward
pass, backpropagation, optimizer step. I assume that the iterative
process you wish to use to tune your system parameters is separate
from the pytorch optimization system used to train the weights. Is
this correct, or are you intending to use pytorch optimization to tune
your system parameters?

I am guessing that you assume that you have trained a neural
network that, given a set of system parameters, gives you the
system’s “error probability.” Then you use the neural network
as a black box, having it give you the the error probabilities as
you reduce your parameters using your tuning scheme.

Is this correct?

I assume that your dataset comes from actual measurements on an
actual system – or perhaps from a (non-neural-network) model of
your system – where you measure the error probabilities for a sets of
system parameters. In particular, you have this dataset independently
of – and before you start to build – your neural network.

Is this correct?

So, a set of floating-point numbers in (the system parameters), and a
floating-point number out (the error probability).

And I assume that the error probability is sensibly continuous. That is,
that, say, 28.6% is “close to” 28.5% and 28.7% in a meaningful way,
but is further away from 35.4%.

It sounds like you should try using MSELoss (mean-squared-error).

Using the error probabilities as labels (in the typical neural-network
sense) likely doesn’t make sense. Usually, one label (say, “cat”) is not
closer to nor further away from any other label (say “fish” or “bird”).

Trying to turn meaningfully continuous floating-point number into labels
is artificial, and is likely to throw away some of the useful information in
your dataset.

This makes sense.

This sounds reasonable.

This makes sense.

This sounds like it might be suboptimal to me. It’s true that you know
your output should be between zero and one, so passing it through a
sigmoid() will force it to be between zero and one. But given what
I think your error probability means, my intuition tells me that it might
be better (in this case) for the network to learn that values outside of
(0, 1) aren’t good predictions, penalizing such values – but neither
preventing nor correcting them – when they are output by the network.

Why? “Has to be” is a problematic phrase when you are asking for
suggestions about how to solve a problem.

Again, why? If it were me, I would prefer to change the structure, if
doing so would better solve my problem.

Also, to be clear, based on my current understanding of what you
are trying to do as outlined in my comments above, I am currently
suggesting that you use MSELoss without (or perhaps with) a final
sigmoid() activation.

Well, if your problem is indeed to predict a continuous output variable,
then I think you’re right the BCELoss is not well-suited to your problem.

To repeat what I said above, based on the assumptions I outlined,
I no longer think that this is a classification problem.

I would train as outlined above, probably using MSELoss as the loss
function.

Yes, if by “supervising the error probability” you mean using a loss
function to quantify the discrepancy between the measured error
probability from your dataset with the network’s predicted error
probability, and then using backpropagation of the loss function to
train your network weights.

This sounds like you would be throwing away useful information in
your data set, and therefore would be making it harder to train your
network.

As I said above, I don’t think that BCELoss is the right loss.

But, as an aside, the optimizer doesn’t try to lower the loss to zero.
It tries to lower the loss to its minimum value, and it doesn’t care
whether that minimum value is 1,000,000 or 0, or -1,000,000. In
the case of BCELoss, if your target value is 0.5, the minimum loss
will not be zero, but it will occur for a predicted value of 0.5.

I agree with this. Assuming I understand your problem correctly, you
want to train your network to predict the actual continuous floating-point
error probability (not just whether is is above or below some threshold),
based on the known, measured values in your dataset.

Good luck.

K. Frank