Single vs multi-label vs multi-class training best practice

Hi, we use convolutional networks (like shufflenet) and are training single, multi-label and (single label) multi-class PyTorch models. We also need probability values for each prediction.
We would like to check best practice for these cases, and we haven’t found a clear answer online so far.

Describing each scenario:

Single label / binary

• training label: single label (0/1)
• training criterion: CrossEntropyLoss
• prediction class: max(outputs)
• prediction probability: softmax(outputs)

Single label / multi-class

• training label: single label (0-X)
• training criterion: CrossEntropyLoss
• prediction class: max(outputs)
• prediction probability: softmax(outputs)

Multi-label / binary

• training label: an array of binary labels [0/1, 0/1, …]
• training criterion: BCEWithLogitsLoss
• prediction class: for each output > 0
• prediction probability: sigmoid(outputs)

Is this the right way of doing this?

For single binary label, is it better to use arrays with single elements for training data and subsequently treat similar to multi-label? Should the same be done to train multi-class using input values no greater than 1?

Also for single binary label (2 classes), what’s the right way of obtaining a meaningful probability?

All feedback appreciated!

Hi Joost!

You may treat a binary problem as a two-class multi-class problem,
and then you would use `CrossEntropyLoss` just as you would for
a more-than-two-class multi-class problem.

But it’s a little clearer and modestly more efficient to treat it explicitly
as a binary problem. in which case you would …

Your training label would be a single floating-point value (per sample)
between `0.0` and `1.0`. This is the probability that the sample is in
the “positive” class (the “yes” class or “class-1” or whatever you want
to call it).

If you want a purely binary label, just restrict the label to be either
exactly `0.0` or exactly `1.0`.

Use `BCEWithLogitsLoss` (and have your predictions be the output
of your final `Linear` (or convolutional) layer without any subsequent
non-linear activations). These predictions will now be raw-score logits
that run from `-inf` to `inf`.

To get your prediction class as an integer class label equal to `0` or `1`,
threshold against `0.0`:

``````prediction_class = (outputs > 0.0).long()
``````

For the binary case the predicted probability of being in “class-1” is:

``````prediction_probability = outputs.sigmoid()
``````

Yes.

This should be `argmax()`.

Let’s assume that `outputs` has shape = `[nBatch, nClass]`. Then:

``````prediction_class = outputs.argmax (dim = 1)
# or equivalently
prediction_class = outputs.max (dim = 1)[1]
``````

Yes.

Yes. (Again, these could be probabilistic labels between `0.0` and `1.0`.)

Yes.

Yes. Treat a single-label binary problem as a multi-label, multi-class
problem where you have just one class (and hence just one label).

If I understand your question correctly, the predictions (“input values”?)
for a (single-label) multi-class problem should be logits that run from
`-inf` to `inf`, rather than probabilities that would be restricted to be
no greater than `1.0`. This is because pytorch’s `CrossEntropyLoss`
expects its `input` (the predictions) to be logits. (It might better be called
`CrossEntropyWithLogitsLoss`.)

For a single-label binary problem, your prediction (your `outputs`) should
be a single (per sample) logit (and you should use `BCEWithLogitsLoss`
rather than `BCELoss`) and you obtain the predicted probability of your
sample being in “class-1” by taking `outputs.sigmoid()`.

Best.

K. Frank

1 Like

Hi @KFrank, thank you very much for your detailed response!

So for single label non-multiclass I understand now it’s better to use a single model output in combination with BCEWithLogitsLoss. This also removes ambiguity of the classes and probabilities.

Regarding the input values used for multi-class, what I meant was the model input / training labels. Re-reading the documentation, CrossEntropyLoss accepts either an array of probabilities or single multiclass label, so the latter seems more convenient and readily usable from our source labels.

Here is the updated overview (re-ordered for logic). (I wrote this in pseudo language, but indeed for multi-class we use argmax getting the index of the maximum value.)

Single label / multi-class

• training label: single label (0/1/2/…)
• training criterion: CrossEntropyLoss
• prediction class: argmax(outputs) ( `outputs.argmax(dim=1)` )
• prediction probability: softmax(outputs) ( `outputs.softmax(dim=1)` )

Single label / binary

• training label: single (binary) labels [0.0/1.0]
• training criterion: BCEWithLogitsLoss
• prediction class: {output<=0: 0, output>0: 1} ( `(outputs > 0.0).long()` )
• prediction probability: sigmoid(outputs) ( `outputs.sigmoid()` )

Multi-label / binary

• training label: an array of (binary) labels [0.0/1.0, 0.0/1.0, …]
• training criterion: BCEWithLogitsLoss
• prediction class: for each {output<=0: 0, output>0: 1}
• prediction probability: sigmoid(outputs)

The other thing we came across worth noting is that for the loss function usually the same size tensors are needed. So if the output layer has a size of for example 1000 (like Shufflenet), for training X labels the model output needs to be cropped to match the number of labels before passing it into the loss function.