Imbalanced classification, accuracy tops out even with more information

I apologize if this is not pytorch-specific enough.

I have a decently imbalanced data set for a time series classification problem (several classes, with the smallest class being ~1/8 as common as the most, and the most is about 50% of the total). Unfortunately, due to continuous nature of the data, and my desire to use the entire time series of each sample as the input (it is expected that contextual information is very important in classification), I cannot over or under sample at all.

Therefore I set the weights to be the inverse proportion of the class prevalence in the training set (i.e. largest weight 8, smallest weight 1). I will say that currently I am still adding capacity, because I can’t seem to really overfit the data yet. (although huge models just underfit).

I was noticing that the most common class was getting a lower accuracy than the other classes. So, in a lets-try-it-and-see test, I created a separate model to just classify that one class vs other, which achieved ~80% accuracy. I then fed this prediction into the first model as a single feature for each chunk of time that has to be classified. So, obviously the model should have a lot more information about how to classify that particular class. However, the loss is stopping at pretty much exactly the same value, with basically the same confusion matrix.

I’m flummoxed as to where to go from here. The class weights themselves seem to be limiting me, but when I’ve tried unweighted, there is too much of a hit on all of the other classes except the most common. I’ve thought about building separate models for each class, but even if each individual model does well, there still has to be a “decider” model that takes their output, which will again have to deal with the imbalance.

Your idea of using “specialists models” sounds interesting and I would suggest to dig a bit deeper into this approach.
As far as I understand you tried to train a separate model to classify the worst performing class against all others (stage0 model).
After this is done you are feeding this prediction into your base model (stage1 model) and try to classify all samples.
How are you feeding the prediction of your stage0 into stage1? It might be for example a scaling issue so that this particular useful feature will be difficult to learn as your other features might be completely in another range thus masking the prediction. Could you check it and rescale the features if necessary?

I’m not sure I understand the limitations of your dataset right. Could you provide some sample data with random values, e.g.:

nb_samples = 100
nb_featues = 10
seq_len = 45
nb_classes = 5
data = torch.randn(...)
target = torch.empty(..., dtype=torch.long).random_(nb_classes)

I would like to take a look, if a weighted sampling approach is not possible.

Yes, you are correct in the order of what I did. Train the “stage0” model to predict the worst performing, but also most prevalent class (probably worst performing due to the class weights used in crossentropyloss). And after that, I fed just the softmax output for the class (not the “all others”) output, into the stage1 network.

As far as the scaling. First, my “stage1” (or “base”/“original”) network has several different inputs that don’t all come in at the same level. So when I added this new feature I tried 3 different models where I put it in 1) at the bottom of the “feature extraction block” (couple dense layers with relus and dropouts), 2) at the top of the “feature extraction block” so it was basically an additional feature, and 3) at the bottom of the “classification block” (couple of dense layers with relus and dropouts at the top of the network). There are also batchnorms in between those sections. And I feel like the network made immediate use of this new feature, as the initial training epochs started off with a substantially lower loss, but then trended down to the same loss I was achieving before.

I describe the data here (Per-class and per-sample weighting). However, I’ll summarize again:

The nature of the problem is to classify every segment of a time series recording into 1 of X classes. I have thousands of recordings, and each recording is approximately 1000 segments long (but varies considerably - which I handle with padding). My inputs are the raw data for the recording for each segment (about 1x6000), a transform of this (7x200), and a second transform (1x100). (I have tried eliminating one or more of these inputs, but the performance suffers.) It is known that it is not possible to classify the segments in isolation, so context is necessary. How much context is unknown put probably more than 10 segments on either side. Because I don’t want to deal with playing with the context size, I instead just train on the entire recording (for the batch) at the same time. I am using a relatively new idea (temporal convolutional networks) instead of LSTMs to deal with the time aspect, and they seem to be working really well.

Let me know if you have further questions about the nature of the problem.

Work since last post
So, I was thinking about the fact that the crossentropyloss as the target function may not be the best for this problem (even though it is “the” loss for multi-class classification). So I wrote my own, that makes use of the confidences in the predictions and mirrors the actual accuracy function (cohen’s kappa) that I’m using (but inverted, so that the loss decreases as the “accuracy” increases). This eeked out some additional gains, but at the expense of the network just completely ignoring the smallest class.

My next attempt is to modify the loss in the following way: instead of using one kappa for the entire confusion matrix, calculate each individual kappa (one for each class-vs-other) and take the the product of them. The thinking is that if any one kappa suffers, then the overall loss will grow. And only if all of the kappas are doing well, will the loss decrease.

I’m still not 100% convinced of this method. But it seems like imbalanced multi-class classification is still just a tough problem (especially when you can’t over or under sample).