Hi everyone,
I am working on a classification question, where the outcomes contain more than one categorical variable. Each of them has multiple classes. For example, y could be [y1, y2, y3, y4, y5] and each y_i is a categorical variable with multiple classes.

The only solution that I could think of is to have separate output layers (softmax) for each categorical variable. This is very tedious. I wonder is there any other easier methods to deal with this type of question? Thank you!

Your use case sounds like a multi-label classification, where each sample might belong to zero, one or multiple classes.
If that’s the case, you could use a final linear layer with out_features=nb_classes, use a one-hot encoded target in the same shape, and apply nn.BCEWithLogitsLoss as your loss function.

I don’t see the use case you describe as being a multi-label,
multi-class problem. I guess I would call it something like
“multiple multi-class.”

That is (and to make things simpler, let me speak of just y_1
and y_2), let’s say y_1 indicates one of three colors
(0 = red, 1 = green, 2 = blue), and that y_2 indicates
one of four animals (0 = parrot, 1 = cat, 2 = cobra, 3 = shark). You classify images according to y_1 and y_2.
So each image is give exactly one color label and exactly one
animal label. Therefore this is a set of two single-label,
multi-class classification problems (running the same network
on the same data).

Is this what you mean?

If my description above represents your use case, I don’t think
that there is anything entirely built in to pytorch that will do
what you want. But doing it in pieces should be straightforward.

For the y_1, y_2 example, I would do something like this:

Let your final layer have 7 outputs (3 for y_1 and 4 for y_2).
Don’t pass it through any kind of softmax(). Let your target
data be a pair of integer categorical class labels, the first
for y_1 that runs over {0, 1, 2}, and the second for y_2
that runs over {0, 1, 2, 3}.

The output of your model, pred, will have shape [nBatch, 7]
and your batch of targets, targ, will have shape [nBatch, 2].
Then calculate:

(Remember, cross-entropy() has, in effect, softmax() built in.
So it might better have been called cross_entropy_with_logits().)

As an aside, you could – mathematically equivalently – use two
final output layers.

Let’s say your second-to-last layer has 32 outputs. Then you
could feed these outputs to both a Linear (32, 3) and a Linear (32, 4). Then use the outputs of these two Linear
layers to calculate loss_y_1 and loss_y_2, as above.

But, again, there is no* mathematical difference between these
two approaches. To me, the first approach of using a single final
layer with 7 outputs seems a little cleaner and less error-prone.

*) I believe that the default initialization of the Linears’ weights
will scale the random initial weights differently.

In the first case you backpropagate through the network once; in the
second, twice. Backpropagating once saves on computation, and these
savings can be significant.

Also, if you want to use the second approach where you backpropagate
twice, you will have to monkey around with setting retain_graph=True.

I can’t think of a use case where I would prefer the second approach.