Handling "Nones" in multilabel classification

shartzog · September 17, 2020, 4:53pm

I’m working on a multilabel classification problem. In my current version, I have four potential labels: “Hospitalized”, “Intubated”, “Deceased”, and “Pneumonia”. My model trains well and has provided some interesting insights on cases with at least one label, but none of my post-training analyses account for the cases that have NO labels, i.e. patients that contracted the disease but recovered at home with no complications.

I started trying to modify my post training analysis routines to add in a 5th “None” label, but quickly realized this wasn’t going to work. What was my predicted likelihood of “None”? Is it (1 - [sum of other label likelihoods])? Not really… This value is often negative. At that point, I considered adding in a 5th “None” label before training and revising my multihot target matrices accordingly, but I wasn’t sure that was the right way to go either given that the truth value of that 5th label can be easily calculated given the truth values for the other four labels.

So my questions are two fold:

Is adding in the 5th “None” label prior to training the right way forward?
How do interdependencies between labels (i.e. label E is always 0 if label A, B, C, or D is 1 and always 1 otherwise) impact multilabel classification problems?

KFrank · September 17, 2020, 9:41pm

Hi Sam!

You do not need (or want) a “None” label. The “None” case is indicated
by none of your four given labels being active.

You have a multi-label, multi-class classification problem. It is
multi-class because you have four classes, “Hospitalized,” “Intubated,”
“Deceased,” and “Pneumonia.” It is multi-label because any number,
including all and none, of the labels can be active for any sample.

You say your model trains well. I’m assuming that it trains well for
all label combinations, including no active labels (and, hopefully, you
have some no-active-label samples in your training data). Provided
that this is the case, your model is fine, and, if you have an issue with
the “None” case in your post-training analysis, you need to fix it there,
rather than trying to tweak your model to sweep the issue under the rug.

Best.

K. Frank

shartzog · September 18, 2020, 12:24am

@KFrank
Thank you for your feedback! That was my initial reaction as well, but between posting this and reading your response, I went ahead and tried training with a None label added. The new model seemed to do quite well. The accuracy of my other labels seemed improved if anything (at least as far as I’m able to interpret accuracy at this point). Maybe I’m just thinking about it wrong since, technically, that last label is “Recovered at Home” (which I DO actually want to be able to predict) rather than “None”, but that still begs the question about how label interdependencies might influence model results. (NOTE: I realized after posting that there’s another obvious inter-label relation present in my use case, namely you won’t be intubated if you’re not hospitalized.)

My use case is also, to put it mildly, atypical. In the end, the likelihoods themselves are the critical output, not a hard label prediction. I still plan to do some threshold analysis and possibly renormalize my likelihoods, but the ultimate goal is to have the ability to predict “risk of an outcome” rather than the outcome itself. Does any of that change anything? Or would you still recommend removing the final label and trying to figure out a method to estimate the “Recovered at Home” aka “None” outcome after the fact?

Thanks again for taking the time!
–SEH

KFrank · September 18, 2020, 1:38am

Hi Sam!

As a side note, for the “None” case to actually be “Recovered at Home,”
there must be some unstated assumptions in play, namely, that
pneumonia always leads to hospitalization, and that the person was
indeed sick and either died or recovered. (Otherwise, cases like
“Never Sick,” “Still Sick at Home (but not with Pneumonia)” would be
consistent with the “None” case, and “Recovered at Home (from
Pneumonia)” would not be consistent with “None.”

My general view is that if you have uniformly-structured “label
interdependencies,” you should look for a way to build them into
your model. A clear example is that if you know that exactly one
of your labels will be active for each sample, then you recognize
that you are actually working with a single-label, multi-class problem,
and you model it as such (rather than as a multi-label problem).
This way you make it easier for your training because you have
“told” your model that there will be exactly one label, rather than
having your model “learn” this fact.

But if your label interdependencies are, so to speak, “a little of this,
a little of that,” then you’re probably better off letting your model
learn those interdependencies through the training process, rather
than developing some complicated scheme to build them into your
model.

Assuming that all the other labels can be present or not, independent
of one another – that is, that this interdependency is not part of a larger
uniform structure – I would consider it appropriate for your model to
learn this. (And I would assume that your training data contains many
samples with “Not Hospitalized and Not Intubated,” “Hospitalized and
Not Intubated,” and “Hospitalized and Intubated,” but no samples with
“Not Hospitalized and Intubated,” so that your model can, in fact, learn
it.)

If your training data is labelled with probabilities, then you can sensibly
train your model to predict probabilities. But – to my mind, at least – if
your labels are hard labels (e.g., 0 or 1, but not 0.625), then I view the
probabilistic predictions (e.g., logits from a final Linear layer) made
by your model to be more an uncertainty about what the model has
successfully learned, rather than a prediction of a probability. But
perhaps such a distinction doesn’t matter in practice, or perhaps it’s
purely semantic.

Best.

K. Frank

shartzog · September 18, 2020, 1:59am

@KFrank
Great points re the ability to interpret “None” as “recovered at home”! Now that you mention it, I KNOW there are examples in my dataset in which Pneumonia was indicated AND the patient still recovered at home!

As for confounding factors related to cases that have not yet “concluded”, I controlled for those by only considering cases where the onset of symptoms was more than a month in the past. That has it’s own set of issues (e.g. it won’t reflect improvements in treatment in real time), but at least 99+% of cases will be resolved one way or the other by then.

I think my path forward will be to add in the 5th “HomeRecovery” label without excluding pneumonia cases. I’d also love to be able to capture the “Never Sick” vs. “got sick but still recovered at home” distinction, but it’s just not present in the data as far as I can figure.

Finally, to your last point, I realize I’m out on a limb here by trying to interpret my outputs as probabilities, but… That’s exactly what I’d like to be able to do. Hence my struggles interpreting accuracy, I guess…

Regardless, I really appreciate your insights!!! Thanks a ton!
–SEH

EDIT: Changed “solution” to be your first response rather than my initial reaction to mark the second. The first response will likely be more valuable to users with a more typical use case.

KFrank · September 18, 2020, 4:00am

Hi Sam!

Well, I shouldn’t pretend to have more than a superficial understanding
of your use case, however …

Let me make a few more comments.

I think you’re overly fixated on adding another label. I think the
multi-label approach is your friend, and is already doing the necessary
work for you.

So, as I understand it, in all of your samples the person has been
sick (“onset of symptoms”), and, as an approximation, has either
died or (approximately) recovered (“more than a month in the past”).
So let’s take those as the rules of the game.

Under these assumptions, is seems to me that your “HomeRecovery”
label is exactly equivalent to “Not Hospitalized” and “Not Deceased.”
Again, I think multi-label is your friend.

Yes, if the reality of your data is that (to a good approximation) each
person has been sick, and (to a good approximation) you don’t have
any “Never Sick” samples in your data, then you won’t be able to
capture a “Never Sick” vs. “got sick” distinction.

To recap: From what you’ve said so far, it seems to me that the four
labels you discussed originally are just the set you need to capture
all of the cases in your data. Adding more labels – be they extraneous,
conflicting, or redundant – will, as a general rule, just muddy the waters
and make your model harder to train.

Best.

K. Frank

shartzog · September 18, 2020, 3:59pm

@KFrank
Thanks again for the thoughtful response!

Point taken. How, then, should I calculate a “HomeRecovery” likelihood after the fact? I can’t really assume independence between “Hospitalized” and “Deceased”, so simple methods e.g.

P(HomeRec) = 1 - P(Hosp U Deceased)

won’t work. Or am I thinking about it wrong?

EDIT: Technically, I realize I AM, in many ways, “thinking about it wrong”, not least by interpreting the outputs of my model as probabilities, but those reservations aside, any idea how the math should work?

Thanks again!
–SEH

KFrank · September 18, 2020, 7:33pm

Hi Sam!

I’m not really sure what you are asking, but from the perspective of
your dataset, you can simply count:

Empirically observed probability = number of samples having both
“Not Hospitalized” and “Not Deceased” divided by total number of
samples.

On the contrary. If I interpret the U as meaning “union” or “or,” this
is correct.

Using the trivial identity:

1 = total_samples / total_samples

your expression becomes:

P(HomeRec) = (total_samples - samples_with_Hosp_or_Deceased) / total_samples
           = samples_with_Not_Hosp_and_Not_Deceased / total_samples

This doesn’t assume that “Hospitalized” and “Deceased” are
independent. Indeed, the “union” (your U) is what accounts for any
interdependency.

Let’s rewrite your probability in terms of an “intersection” (“and”):

P(HomeRec) = P(NotHosp and NotDeceased)

What you can’t do (without"Hospitalized" and “Deceased” being
probabilistically independent) is the following:

P(NotHosp and NotDeceased) = P(NotHosp) * P(NotDeceased)

That is, you can’t factor the joint probability into a product of two
individual probabilities unless the two conditions are independent.

Best.

K. Frank

shartzog · September 19, 2020, 12:58am

Thanks again, @KFrank!
I especially appreciate the detail in your explanation of the statistics! I’m following how all of that would apply for the entire dataset, but I’d like to be able to produce an estimation of the likelihood of “HomeRecovery” for an individual sample. Given:

P(NotHosp and NotDeceased) != P(NotHosp) * P(NotDeceased)

as stated above and:

P(Hosp or Deceased) != P(Hosp) + P(Deceased)

since Hosp and Deceased are not mutually exclusive, I’m struggling to see how I can estimate P(HomeRec) for an individual sample. Wouldn’t I need some kind of sophisticated simulation technique?

I feel like I’m missing something simple, so, as always, I appreciate your advice!!!
–SEH

EDIT: To be clear, I’m using (incorrectly) probabilistic notation to represent model likelihood for a single sample, not “true probability”. Sorry for any confusion!

EDIT 2: Maybe my question would be better stated as “How can I estimate P(Hosp or Deceased) (or equivalently, P(NotHosp and NotDeceased)) for a single sample?”

KFrank · September 19, 2020, 9:19pm

Hi Sam!

The more I think about it, I do believe that your proposal to train on
a “HomeRec” label is the way to go.

A disclaimer: Because your data is not labeled with probabilities,
you’re not training your model to predict probabilities. Consider a
case where you overfit your model on your training data. You’ve
trained it to predict 0s and 1s, so you get mostly predictions along
the lines of 0.01 and 0.99 with a very low loss and very high
accuracy on your training set. But let’s say that you get a useful,
but middling accuracy of 75% on your validation set. Your model
(because it was overfit) has high confidence in its predictions (close
to 0.0 or 1.0), but these are not related to non-training-set
probabilities.

Back to your question:

You could add the additional “HomeRec” label to your multi-label
classification problem, train that, and interpret your “HomeRec”
prediction values (converted from logits to probabilities, as appropriate,
depending on the output of your model) as the probability of a single
sample being “HomeRec.” To tie this back to the earlier discussion,
we can think of the model “learning” how the correlated “Hospitalized”
and “Deceased” probabilities should be combined to produce the
“HomeRec” probability.

Note, when training such a network, you should probably stop the
training when your training loss and validation loss first start to diverge
even if further training were to produce a model whose validation
and/or test loss and accuracy were better. This is because you’re
trying to make your predictions best reflect their own uncertainty,
rather than make them as good as possible, but without a good
measure of their uncertainty.

Here’s a couple more things to consider:

If you want to focus narrowly on predicting the “HomeRec” probability,
then perhaps you should train a pure “HomeRec” vs. everything else
binary classifier. Then the classifier won’t make compromises to
get the other classes right at the expense of not doing as well on
“HomeRec.” But you could argue to the contrary that by not training
on the other classes you’re giving your training procedure less
information. Perhaps learning the other classes helps your model
learn relevant “features” that help it do a better on “HomeRec.”

Another point (that I’ll outline in the context of a pure “HomeRec” binary
classifier): If your dataset is unbalanced, it might be best to train your
classifier with weighted sampling or class weights in the loss function
so that its predictions won’t be biased in favor of the class that occurs
more frequently in your training set. But, coming back to the issue of
predicting probabilities, if the fraction of “HomeRec” vs.“Not HomeRec”
in your training set is representative of the “HomeRec” fraction in the
“real-world” data to which you’ll be applying your model, then perhaps
you do want your predictions to be so biased. Consider the extreme
case where your model gives the same prediction value, regardless
of its input: If your real-world and training data both consist of 30%
“HomeRec” samples, then you would want that single prediction value
to be 0.30, rather than the “unbiased” 0.50.

Good luck.

K. Frank

shartzog · September 20, 2020, 6:41pm

As always, many, many thanks for your time and advice, @KFrank!

Thank you for the explanation re the above! That makes it much clearer why my ‘likelihoods’ are NOT likelihoods OR probabilities in any real sense and how interpreting them as such could lead to false impressions.

Great tip and explanation!!! I’ll be playing around with learning rates and early stops to see how they influence model behavior. I’m guessing a reduced learning rate will be preferable to aggressive learning with an early stop, but there’s one way to find out…

This will be an interesting test. I’d like to see how my predictions compare. Prior to now, I’ve been thinking about my multi-label model as if it were a set of parallel binary classifiers, but that’s clearly a bad analogy.

I think this might be key in my use case. My dataset is ambiguous to say the least. It’s possible (and in fact, common) for the exact same input to appear multiple times with different target labels. E.g. I could have three 50 year old men with hypertension, one who was hospitalized with pneumonia but recovered, one who recovered at home with no complications, and one who died of covid related cardiac arrest without ever being hospitalized. My input “image” would be the same in all three cases since it’s “calculated” from tabular data, not a true image. I think this is the key difference between what I’m attempting to do and the traditional image recognition use case. There really isn’t a right answer. That’s certainly the reason that I’ve been so hell-bent on interpreting my model results as probabilities.

But back to your insight re bias… I think I actually do want that bias reflected in my predictions, but if so, I’d be making some strong assumptions regarding the quality, comprehensiveness, and representativeness of my data, i.e. assumptions that seem best avoided. Either way, I think I’ll do some trial runs with class weights and some with batch normalization to see how they influence my results. At the very least, that’ll be a valuable training exercise.

At the end of the day, I realize this all might be a fool’s errand. I’m not sure I’ll ever be able to make statements like “diabetes increases risk by 10% for women aged 40 to 50” as I originally hoped, but I’ve definitely learned a ton about PyTorch, so it’s been worth it either way…

Thanks again for all of your thoughts and advice!
Cheers,
SEH

KFrank · September 20, 2020, 7:25pm

Hi Sam!

Just one quick comment: On the contrary, I think understanding a
multi-label, multi-class classifier as a set of parallel binary classifiers
is exactly the right way to think about it. The “parallel” classifiers
have an efficiency advantage in training and inference, because
you pass any given sample through one network instead of many,
and much of the work is shared. They can also have an accuracy
advantage, because training on all of the classes at the same time
can help your classifier learn more “insightful” features that end up
working better for all classes. (One could certainly imagine, however,
a ten-class dataset, where the first nine classes cause training on
all ten classes to hurt the performance for the tenth class.)

Best.

K. Frank

Deeply · September 21, 2020, 6:56pm

I would like to add my two cents, although I didn’t read all the discussion. If the labels of ‘None’ are known a prior; then you can predict ‘None’ as a form of Anomaly Detection based on the other classes.

Otherwise, I think it’s a good option to add a/the ‘None’ label if they are available at training time. By not doing so, what you are trying is the following:
Designed labels
[1, 0, 0, 0]; [0, 1, 0, 0]; [0, 0, 1, 0]; [0, 0, 0, 1]

Then, the prediction of None is made via the Ad-hoc label [0, 0, 0, 0]. IMHO, having five classes (after including None); the model should output five logits. This is what the machine literature has found to work best.

Finally, it is a good idea to pay a special attention at the data balance; as it is expect that samples with ‘None’ (presumably healthy subjects) to dominate such datasets.

shartzog · September 21, 2020, 8:19pm

I’ll have to do some research about Anomaly Detection and how it works. Sounds like it could be useful for a lot of things! Thanks for the tip, @Deeply!