Multi-class multi classifier in one network

Hello everyone,
I have a dataset where each sample needs to be classified in 5 classes and each class can take values between 0 and 4, only one of them. for example a possible target can ben [0,1,1,3,2]
I have already a working model using regression from the original continuous labels for the 5 classes (each class in [0,1] range) which in this case I binned to turn the problem in a classification problem to see if I can reach better performances, handling the new imbalanced version of the dataset due to the binning. Basically I want to see if simplifying the problem to less discrete classes can improve the performance.
do you think I should treat this problem as 5 different multiclass classifiers and in this case is there a way to write the network so I don’t have to run the same network 5 times
or more shall I encode each of the 5 classes as one hot and go for the standard classification postprocessing the output of the network splitting it accordingly to the 5 classes and then do argmax, or even using the binary_crossentropy loss? or thinking of the problem as multi label classification?
these are many ideas, what would work better in your experience?

You could create 5 different “heads” in your model, which just uses a small classifier (e.g. single linear layer) to predict the different classes.
This approach would reuse the complete model and just run all heads separately.

I’m not sure I understand this approach. Could you explain it a bit more?

This might work, although the constraint that only 5 classes can be active at all times will be lost.
Still a valid approach and worth a try in my opinion.

Of course. The idea here would be to concatenate one hot encoding of each 5 classes so that I would have 25 outcome/target from the network, 5 for each classes as I want, then split this 25 array in 5 accordingly supposing there will be one hot in each of the 5 sub arrays and use the argmax to get the predicted class.
Does that make sense? in this case I don’t need 5 different heads, but not sure what loss would be appropriate, as from the documentation is not super clear whether cross_entropy can still work or I should use something like MultiLabelMarginLoss as others requires the target to be between 0 and 1.

in this case I guess I will need 5 different losses (?)

You would calculate 5 different losses, which you could average into a single one to call .backward() on.
Actually both approaches might be quite similar.
Either you split your last layer(s) into 5 different heads, each with an own loss, or you split the large layer (25 outputs) into 5 x 5 snippets to calculate the losses separately.
While I’ve used my proposal in the past, I’ve never seen yours in the wild, so it would be interesting to see if it’s working.

EDIT: After thinking about it a bit more, both approaches should be mathematically equal.

yes, that’s what I thought, averaging the total loss of the 5 different heads in case

we already use a similar approach of one hot encode and then split and compute the individual losses and works pretty well I would say.

I was curious to hear it there were alternatives to this or a better/optimized way to code the whole things.

thanks a lot for the prompt replies I really appreciated your help!!