Hi JB!
Continuing with the still-image example use case you gave, I can
see two approaches:
The first approach is a little simpler, but probably wouldn’t be my
choice. (It should work, though, and is probably worth a try.)
An image can either be a “not-car” or it can be exactly one of 32
kinds of “car.” That is, there are 4 choices for color, 4 choices for
type, and 2 choices for conductor, so 4 x 4 x 2 = 32. (If you knew
for a fact that you would never have a white SUV you would,
barring other constraints, only have 30 kinds of “car,” but let’s go
with the full 32.)
Therefore you have 33 classes. So treat this as a straightforward
single-label, 33-class classification problem. (Your last layer would
likely be a fully-connected Linear
with out_features = 33
and
you would likely use CrossEntropyLoss
as your loss function.)
But your classes have substantive structure (they’re a “product” of
color and type and conductor) and this is structure that your model
would need to “learn” in the 33-class approach. You might well get
better results if your model architecture has (some of) this structure
built in.
The second approach takes this structure of your classes into account
(but has the disadvantage that if your classes are not formed by the
full product of your subclasses – say, for example, you never have a
white SUV – there is no natural way to build this in, so the model will
have to “learn” this constraint).
The network will be, in essence, four separate classifiers that share
almost all of the same upstream processing and features.
In its simplest version, you would have a final Linear
layer with
out_features = 10
.
Suppressing the batch index, output[0]
would be your prediction
for “car” / “not-car” (understood to be a single binary-prediction logit,
most likely fed to BCEWithLogitsLoss
as its loss function).
output[1:5]
would be the 4-class prediction for color and
output[5:9]
would be the 4-class prediction for type, both
fed to CrossEntropyLoss
, and output[9]
would be the binary
prediction for “conductor” / “not-conductor”, fed to BCEWithLogitsLoss
.
If the target
(the ground-truth label) for “car” / “not-car” is “not-car”
you ignore the predictions for color, type, and conductor, skipping
(or multiplying by zero) the computations for their loss functions, and
just use the “car” / “not-car” BCEWithLogitsLoss
as the total loss
function.
If the target
is “car,” you compute all four loss functions and use
a weighted sum of them as your total loss function. (The weights
will be non-trainable hyperparameters, tuned by you, based on how
things train, how your results look, and on the relative importance
(to you) of getting, say, the color right vs. getting the type right.)
You could also break your final Linear
layer up into four separate
“heads,” one for each of your four classification subproblems, all fed
by the same upstream “feature vector.” If your four heads are all just
Linear
layers with out_features = 1, 4, 4, 1
, respectively,
this is mathematically no different than one single Linear
layer
with out_features = 10
. Using four Linear
-layer heads does
have the advantage, however, that you can use different optimizer
hyperparameters (e.g., different learning rates) for the four heads.
Note that separating your final “layer” into four heads has the
potentially substantive advantage that your four heads can have
different architectures. For example, maybe a single Linear
(with
out_features = 1
) gives good-enough results for “car” / “not-car,”
but a sequence of two Linear
layers (with the final Linear
having
out_features = 4
) turns out to be helpful in getting good results
for your type class.
Best.
K. Frank