Hi Andrew!
Your network does what your loss function trains it to do.
As a general rule, if you train your network to do better on one thing,
then – all else being equal – it will be likely to do worse on something
else.
(That doesn’t mean you can’t train your network to do better on
everything, perhaps by training longer, or using a more apt loss
function, or using a better optimization algorithm, or training with
more or better data, etc.)
My point is that using class weights that favor class-0, you’re telling
your network training that you care less about getting classes 1 and
2 right, so, in general, your network won’t perform as well on classes
1 and 2, including mislabelling them as class-0 – because that’s what
you trained it to do.
Now, if you don’t really care about mixing up classes 1 and 2, but
want to get class-0 right from both a false negative and false positive
perspective (that is, you’re willing, e.g., to mislabel class-1 as class-2,
but you don’t want to mislabel class-1 as class-0), you could, at the
extreme, train a binary classifier that identifies class-0 vs. everything
else. (Again, a trade-off: do better on both class-0 false positives and
false negatives, at the cost of not distinguishing class 1 from class-2.)
Now, some speculation, because I’ve never actually tried this. You
could add to your conventional three-class loss function (class-0 vs.
class-1 vs. class-2) that does distinguish between class-1 and class-2,
a two-class loss function (class-0 vs other-than-class-0). This will
help train your network to do better on both class-0 false negatives
and false positives, while still somewhat distinguishing class-1 and
class-2, but at the cost of not distinguishing them as well.
This will also bias your network to do better on class-0, but in a
way that’s different than overweighting class-0 in your conventional
three-class loss function. You’d still be making a trade-off, just a
different one. Which way to go depends on the details of what’s
more important to you.
Best.
K. Frank