Hi, we use convolutional networks (like shufflenet) and are training single, multi-label and (single label) multi-class PyTorch models. We also need probability values for each prediction.
We would like to check best practice for these cases, and we haven’t found a clear answer online so far.

Describing each scenario:

Single label / binary

training label: single label (0/1)

training criterion: CrossEntropyLoss

prediction class: max(outputs)

prediction probability: softmax(outputs)

Single label / multi-class

training label: single label (0-X)

training criterion: CrossEntropyLoss

prediction class: max(outputs)

prediction probability: softmax(outputs)

Multi-label / binary

training label: an array of binary labels [0/1, 0/1, …]

training criterion: BCEWithLogitsLoss

prediction class: for each output > 0

prediction probability: sigmoid(outputs)

Is this the right way of doing this?

For single binary label, is it better to use arrays with single elements for training data and subsequently treat similar to multi-label? Should the same be done to train multi-class using input values no greater than 1?

Also for single binary label (2 classes), what’s the right way of obtaining a meaningful probability?

You may treat a binary problem as a two-class multi-class problem,
and then you would use CrossEntropyLoss just as you would for
a more-than-two-class multi-class problem.

But it’s a little clearer and modestly more efficient to treat it explicitly
as a binary problem. in which case you would …

Your training label would be a single floating-point value (per sample)
between 0.0 and 1.0. This is the probability that the sample is in
the “positive” class (the “yes” class or “class-1” or whatever you want
to call it).

If you want a purely binary label, just restrict the label to be either
exactly 0.0 or exactly 1.0.

Use BCEWithLogitsLoss (and have your predictions be the output
of your final Linear (or convolutional) layer without any subsequent
non-linear activations). These predictions will now be raw-score logits
that run from -inf to inf.

To get your prediction class as an integer class label equal to 0 or 1,
threshold against 0.0:

prediction_class = (outputs > 0.0).long()

For the binary case the predicted probability of being in “class-1” is:

prediction_probability = outputs.sigmoid()

Yes.

This should be argmax().

Let’s assume that outputs has shape = [nBatch, nClass]. Then:

Yes. (Again, these could be probabilistic labels between 0.0 and 1.0.)

Yes.

Yes. Treat a single-label binary problem as a multi-label, multi-class
problem where you have just one class (and hence just one label).

If I understand your question correctly, the predictions (“input values”?)
for a (single-label) multi-class problem should be logits that run from -inf to inf, rather than probabilities that would be restricted to be
no greater than 1.0. This is because pytorch’s CrossEntropyLoss
expects its input (the predictions) to be logits. (It might better be called CrossEntropyWithLogitsLoss.)

For a single-label binary problem, your prediction (your outputs) should
be a single (per sample) logit (and you should use BCEWithLogitsLoss
rather than BCELoss) and you obtain the predicted probability of your
sample being in “class-1” by taking outputs.sigmoid().

Hi @KFrank, thank you very much for your detailed response!

So for single label non-multiclass I understand now it’s better to use a single model output in combination with BCEWithLogitsLoss. This also removes ambiguity of the classes and probabilities.

Regarding the input values used for multi-class, what I meant was the model input / training labels. Re-reading the documentation, CrossEntropyLoss accepts either an array of probabilities or single multiclass label, so the latter seems more convenient and readily usable from our source labels.

Here is the updated overview (re-ordered for logic). (I wrote this in pseudo language, but indeed for multi-class we use argmax getting the index of the maximum value.)

training label: an array of (binary) labels [0.0/1.0, 0.0/1.0, …]

training criterion: BCEWithLogitsLoss

prediction class: for each {output<=0: 0, output>0: 1}

prediction probability: sigmoid(outputs)

The other thing we came across worth noting is that for the loss function usually the same size tensors are needed. So if the output layer has a size of for example 1000 (like Shufflenet), for training X labels the model output needs to be cropped to match the number of labels before passing it into the loss function.