How to design a NN that targets probabilities?

I want to train a neural network on some image dataset for a regression task. The output would be a probability, so a value between 0 and 1; I have targets that are also probabilities. I want to know what the best training strategy would be.
Is is meaningful to set the last layer to be sigmoid such that the previous layers can output values from -inf to inf? Or should I simply directly output a value between 0 and 1, that we can later calculate the loss using MSE or MAE. But it would make no sense that the network outputs a value that is more than one/less than 0.
Also, Should I scale the sigmoid? Because points close to 0 will be very sensitive to noise. If yes, should it be a hyperparameter?

I thank you in advance for your help.

Hi Chenoille!

We would typically not use the term “regression task” when training a
network to predict probabilities.

If you believe that your use case is a classic regression task, please
explain it in a little more detail.

Based on what I think your use case is, you should use BCEWithLogitsLoss
as your loss criterion.

Note that BCEWithLogitsLoss takes logits as its input, rather than actual
probabilities. It does take probabilities as its target.

Your network should be whatever architecture is appropriate for your overall
use case, and will depend on the character of the problem you are trying to
solve and the structure of your input data.

However, you would typically want your final layer to be a full-connected
Linear with a single output (out_features = 1). This will be your logit,
and will range from -inf to inf. Do not pass this output through a
sigmoid() – doing so would convert your logit to a probability, which
is not what BCEWithLogitsLoss expects.

In some sense, this is what using logits and BCEWithLogitsLoss is doing
(in contrast to using probabilities and pytorch’s plain-vanilla BCELoss).

I’m not sure that I would call it noise exactly, but using BCEWithLogitsLoss
with logits is significantly more numerically stable than using BCELoss with


K. Frank

Thank you for this answer. That was exactly what I was looking for.