Difference between Multilabel-multiclass classification and multitask classification

Dear computer vision community,

I have to find a suitable model for recognizing different events within video clips. I am a little bit confused in considering my task as a multilabel-multiclass classification or multitask classification problem and how should this be implemented

Explanation of the task: From a video clip I need to know if :

  1. an Event A occured or not

  2. was it successful, not successful, or out

  3. was it performed using right or left hand

  4. Which type of event was it (there are 6 types)

  5. was it done while running, walking, jumping or standing-up

So the 5 labels need to be returned and each returned label can have one of the subclasses.

  1. Should 5 different output layers for 5 different outputs be implemented? Each one of them has n_classes = subclasses

self.fc1 = nn.Linear(block_inplanes[3] * block.expansion, n_classes1)

self.fc2 = nn.Linear(block_inplanes[3] * block.expansion, n_classes2)

self.fc3 = nn.Linear(block_inplanes[3] * block.expansion, n_classes3)

self.fc4 = nn.Linear(block_inplanes[3] * block.expansion, n_classes4)

self.fc5 = nn.Linear(block_inplanes[3] * block.expansion, n_classes5)

or

  1. should I have only one output layer ;

self.fc = nn.Linear(block_inplanes[3] * block.expansion, n_classes)

If the 2nd one is the right answer, how could this be implemented?

I would be very thankful for your precious help.

Hi,

Maybe to be more precise in my question: in case I have multiclass target => Classes (A,B,C,D,E) and each class have subclasses for example:

Class “A” can be 0 or 1
Class “B” can be 2, 3 or 4
Class “C” can be 5 or 6
Class “D” can be 7,8,9,10,11 or 12
Class “E” can be 13,14, 15 or 16

All information is extracted from the same scene and object
…

Can this task be solved using one hot encoder or multi hot encoder. Or this is a multitask classification problem where I should use multiple outputs and calculate loss for each output independatly then sum them together…?

Thank you.

Hi JB!

Could you clarify your use case?

First, let’s temporarily ignore the “subclasses” issue, and just look at
“classes.”

Second, I will purposely speak in the language of two-dimensional
still images rather than “three-dimensional” videos. There is a clean
analogy between the two, so, to focus the discussion, please describe
your use case in the still-image setting.

Let a single sample (which would typically be part of a batch consisting
of one or more independent samples) be a single still image.

Do you wish to perform (single-label), multi-class classification, e.g.,
this is an image of a cat or a pigeon or a shark (but not more than
one at a time)?

Do you wish to perform multi-label, multi-class classification, e.g.,
this image contains a cat, does not contain a pigeon, and does
contain a shark?

Do you wish to perform object detection, that is, this image contains
one cat in bounding-box A, three pigeons, one each in bounding-boxes
B, C, and D, and no sharks?

(Other possibilities would include semantic segmentation and instance
segmentation.)

Now, about your “subclasses:” What is the substantive meaning of
a class vs. a subclass? Why don’t we just say that you have 17
different classes (that happen to be labelled “A0”, “A1”, “B2”, “B3”,
“B4”, …, “E13”, “E14”, “E15”, and “E16”)?

Best.

K. Frank

Hi Frank, thank you very much for your detailed answer. In my use case I am trying to classify player action within a handball video. I want to know if the player performed a pass with some characteristics but as you said let’s talk about still images and take one image.

Let’s assume that I want to classify an image and want to know for each single image if:

  • contain a car / or not
  • which color has the car ( red/black/white/blue)
  • which type of car(Berline/4x4/SUV/Coupe)
  • is there a conductor within the car / or not

About the subclasses:
I thought about considering them as different classes ( 13 in the example above). So it will be considered as a multilabel classification ( many classes could be activated at the same time). But I cannot have for example:

  • SUV and Berline as car type at the same time.
  • Or more than one color at the same time …

I hope I could explain well the use case.

Thanks a lot for your time.

Hi JB!

Continuing with the still-image example use case you gave, I can
see two approaches:

The first approach is a little simpler, but probably wouldn’t be my
choice. (It should work, though, and is probably worth a try.)

An image can either be a “not-car” or it can be exactly one of 32
kinds of “car.” That is, there are 4 choices for color, 4 choices for
type, and 2 choices for conductor, so 4 x 4 x 2 = 32. (If you knew
for a fact that you would never have a white SUV you would,
barring other constraints, only have 30 kinds of “car,” but let’s go
with the full 32.)

Therefore you have 33 classes. So treat this as a straightforward
single-label, 33-class classification problem. (Your last layer would
likely be a fully-connected Linear with out_features = 33 and
you would likely use CrossEntropyLoss as your loss function.)

But your classes have substantive structure (they’re a “product” of
color and type and conductor) and this is structure that your model
would need to “learn” in the 33-class approach. You might well get
better results if your model architecture has (some of) this structure
built in.

The second approach takes this structure of your classes into account
(but has the disadvantage that if your classes are not formed by the
full product of your subclasses – say, for example, you never have a
white SUV – there is no natural way to build this in, so the model will
have to “learn” this constraint).

The network will be, in essence, four separate classifiers that share
almost all of the same upstream processing and features.

In its simplest version, you would have a final Linear layer with
out_features = 10.

Suppressing the batch index, output[0] would be your prediction
for “car” / “not-car” (understood to be a single binary-prediction logit,
most likely fed to BCEWithLogitsLoss as its loss function).

output[1:5] would be the 4-class prediction for color and
output[5:9] would be the 4-class prediction for type, both
fed to CrossEntropyLoss, and output[9] would be the binary
prediction for “conductor” / “not-conductor”, fed to BCEWithLogitsLoss.

If the target (the ground-truth label) for “car” / “not-car” is “not-car”
you ignore the predictions for color, type, and conductor, skipping
(or multiplying by zero) the computations for their loss functions, and
just use the “car” / “not-car” BCEWithLogitsLoss as the total loss
function.

If the target is “car,” you compute all four loss functions and use
a weighted sum of them as your total loss function. (The weights
will be non-trainable hyperparameters, tuned by you, based on how
things train, how your results look, and on the relative importance
(to you) of getting, say, the color right vs. getting the type right.)

You could also break your final Linear layer up into four separate
“heads,” one for each of your four classification subproblems, all fed
by the same upstream “feature vector.” If your four heads are all just
Linear layers with out_features = 1, 4, 4, 1, respectively,
this is mathematically no different than one single Linear layer
with out_features = 10. Using four Linear-layer heads does
have the advantage, however, that you can use different optimizer
hyperparameters (e.g., different learning rates) for the four heads.

Note that separating your final “layer” into four heads has the
potentially substantive advantage that your four heads can have
different architectures. For example, maybe a single Linear (with
out_features = 1) gives good-enough results for “car” / “not-car,”
but a sequence of two Linear layers (with the final Linear having
out_features = 4) turns out to be helpful in getting good results
for your type class.

Best.

K. Frank

1 Like

The proposed solutions are really promising. I tried the four classification subproblems, it worked but the results were not that satisfying. I have to do some adjustments and try it again.

I will also try the other proposed approaches and I think you got the point. I will give you feedback once I have the results.

I am very thankful for your detailed and very constructive response.