Imbalanced classification, accuracy tops out even with more information

Yes, you are correct in the order of what I did. Train the “stage0” model to predict the worst performing, but also most prevalent class (probably worst performing due to the class weights used in crossentropyloss). And after that, I fed just the softmax output for the class (not the “all others”) output, into the stage1 network.

As far as the scaling. First, my “stage1” (or “base”/“original”) network has several different inputs that don’t all come in at the same level. So when I added this new feature I tried 3 different models where I put it in 1) at the bottom of the “feature extraction block” (couple dense layers with relus and dropouts), 2) at the top of the “feature extraction block” so it was basically an additional feature, and 3) at the bottom of the “classification block” (couple of dense layers with relus and dropouts at the top of the network). There are also batchnorms in between those sections. And I feel like the network made immediate use of this new feature, as the initial training epochs started off with a substantially lower loss, but then trended down to the same loss I was achieving before.

I describe the data here (Per-class and per-sample weighting). However, I’ll summarize again:

The nature of the problem is to classify every segment of a time series recording into 1 of X classes. I have thousands of recordings, and each recording is approximately 1000 segments long (but varies considerably - which I handle with padding). My inputs are the raw data for the recording for each segment (about 1x6000), a transform of this (7x200), and a second transform (1x100). (I have tried eliminating one or more of these inputs, but the performance suffers.) It is known that it is not possible to classify the segments in isolation, so context is necessary. How much context is unknown put probably more than 10 segments on either side. Because I don’t want to deal with playing with the context size, I instead just train on the entire recording (for the batch) at the same time. I am using a relatively new idea (temporal convolutional networks) instead of LSTMs to deal with the time aspect, and they seem to be working really well.

Let me know if you have further questions about the nature of the problem.

Work since last post
So, I was thinking about the fact that the crossentropyloss as the target function may not be the best for this problem (even though it is “the” loss for multi-class classification). So I wrote my own, that makes use of the confidences in the predictions and mirrors the actual accuracy function (cohen’s kappa) that I’m using (but inverted, so that the loss decreases as the “accuracy” increases). This eeked out some additional gains, but at the expense of the network just completely ignoring the smallest class.

My next attempt is to modify the loss in the following way: instead of using one kappa for the entire confusion matrix, calculate each individual kappa (one for each class-vs-other) and take the the product of them. The thinking is that if any one kappa suffers, then the overall loss will grow. And only if all of the kappas are doing well, will the loss decrease.

I’m still not 100% convinced of this method. But it seems like imbalanced multi-class classification is still just a tough problem (especially when you can’t over or under sample).