How to handle imbalanced classes

Hello,

I have similar class imbalance problems where my class counts look like this:

{0: 112328,
1: 6321,
2: 169578,
3: 7140,
4: 3704,
5: 6600,
6: 3438,
7: 1155,
8: 1157,
9: 4528}

The following are what I tried:

  1. setting the ignore index flag to the index of the most dominant class above, in the crossEntropy loss function. The overall accuracy went down and the model somehow started predicting the dominant class more.

  2. Tried setting weights in the CrossEntropyLoss function and it bumped up the overall accuracy but it still predicts the dominant class way better than the other classes.

Do we have any other work around? Thanks.

You could also try to oversample your minority classes using a WeightedRandomSampler.

2 Likes

That comes with a dataloader? is there a way to use it without the torch utils dataloader
I have written my own batch generator as

def getBatch(batch_size, train_data):
    #random.shuffle(train_data)
    sindex = 0
    eindex = batch_size
    while eindex < len(train_data):
        batch = train_data[sindex: eindex]
        temp = eindex
        eindex = eindex + batch_size
        sindex = temp
        yield batch
    
    if eindex >= len(train_data):
        batch = train_data[sindex:]
        yield batch

Regards

To use the dataloader, I may have to start considering a word at a time as advised by you earlier, instead of going for a sentence as an image.

Well, I tried using the dataloader given with pytorch and am not sure of the weights the sampler assigns to the classes or maybe, the inner workings of the dataloader sampler aren’t clear to me

sequence: tensor([ 8956, 22184, 16504, 148, 727, 14016, 12722, 43, 12532])
targets: tensor([4, 7, 5, 7, 7, 7, 5, 7, 7])

Can you help?

Regards

The weights should be defined as the inverse class frequency defined for each sample.
Here is a small example. Could you try that or check for differences with your code?

Hello Peter,

This is a conll2003 corpus and a sentence looks like one below:

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O

All I have to do is extract the word and it’s NER tag in the right extreme and make my sentence look like one below:

([‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’] --> One Image
[‘B-ORG\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’]) --> set of tags/labels for an image

Each unique word and tag has an index like so:

{‘B-LOC\n’: 4,
‘B-MISC\n’: 9,
‘B-ORG\n’: 6,
‘B-PER\n’: 7,
‘I-LOC\n’: 8,
‘I-MISC\n’: 3,
‘I-ORG\n’: 1,
‘I-PER\n’: 5,
‘O\n’: 2,
‘PAD’: 0}

Here each word in a sentence is converted to its indices like so:

[ 0, 10397, 21273, 16075, 20124, 15819, 2729, 13298, 14387]

; and is embedded through a nn.Embedding layer (so that each word is a 50 dimensional vector)

and now a single sentence becomes a 2D image. But, instead of having a single label to an image, I have a list of labels to an image. These labels are the NER tags of each word. The problem is, my data-set has a lot of words of ‘O\n’ class as pointed in the comment earlier and so, my model tends to predict the dominant class (typical class imbalance problem).
So, I need to balance these classes.

The code to calculate weights:

indexed_counts #frequency of each class

{0: 112328,
1: 3704,
2: 169578,
3: 1155,
4: 7140,
5: 4528,
6: 6321,
7: 6600,
8: 1157,
9: 3438}

tag_weights = {}
for key in indexed_counts.keys():
    tag_weights[key] = 1/indexed_counts[key]
sampler = [i[1] for i in sorted(tag_weights.items())]
sampler

[8.902499821950003e-06,
0.0002699784017278618,
5.896991355010673e-06,
0.0008658008658008658,
0.00014005602240896358,
0.00022084805653710247,
0.00015820281601012498,
0.00015151515151515152,
0.000864304235090752,
0.00029086678301337986]

(As a side note, am passing this as my weights tensor for the cross entropy loss function)

Just for you to compare that the tag_weights and the sampler is in the same order of the weight classes
tag_weights

{0: 8.902499821950003e-06, 1: 0.0002699784017278618, 2: 5.896991355010673e-06, 3: 0.0008658008658008658, 4: 0.00014005602240896358, 5: 0.00022084805653710247, 6: 0.00015820281601012498, 7: 0.00015151515151515152, 8: 0.000864304235090752, 9: 0.00029086678301337986}

Coming to your suggestion of splitting by classes, splitting a sentence without maintaining the sequence of words before and after, would imply the context of a word might be lost. The model needs to learn the NER tags of a given word given the contextual meaning as well (words appearing before and after a given word).

This use case seems to be a bit more complicated, since you have multiple class indices for each sentence.
Even if we sample each sentence, we would need to calculate the weight for it.
Since the sentence lengths are most likely different, we can’t just multiply the word weights, as this would penalize long sentences.

Assuming we have a few sentences and a dictionary of two words with weights 0.1 and 0.01.
Ideally we would like to create batches with the same number of word counts.
If the sentences look like this:

s0 = [w0, w1]                 -> [0.1, 0.01]
s1 = [w0, w1, w1, w1, w1, w1] -> [0.1, 0.01, 0.01, 0.01, 0.01, 0.01]
s2 = [w1]                     -> [0.01]

, we would need to somehow calculate the valid weight for each sentence.
I haven’t tried it yet, but do you think it would work, if you multiply the word weights and normalize if with the Nth root, where N is the sentence length?
This would yield sentence weights of:

s0 => 0.0316
s1 => 0.0147
s2 => 0.01

Let me know, if I’m missing something.

Am using sentence windowing technique to fix the length of a sentence that my model sees. So, variable length input shouldn’t be an issue

example:

input sentence = (‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’)

output_windows =

(’’, ‘’, ‘’, ‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’),
(’’, ‘’, ‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’),
(’’, ‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’),
(‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’),
(‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’, ‘’),
(‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’, ‘’, ‘’),
(‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’, ‘’, ‘’, ‘’),
(‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’, ‘’, ‘’, ‘’, ‘’)]

Couldn’t understand this line. Any simpler example to understand this? May be that will help me understand the application of class weights better.

Regards

Using simple multiplication of word weights short sentences will have a much higher weights than longer ones, since we multiply with a number in [0, 1].

However if your sentences have a fixed length, try to create sentence weights multiplying the word weights, and see if the class frequencies are approx. equal in each batch.

1 Like

So this means the model will end up weighing the sentences instead of weighing the classes of each word in a sentence?

Yes, exactly. The sentences will be weighted using the word frequencies.
Since you are using a sliding window approach, you would need to pre-compute the sentence weights using the word frequencies.

And this would also mean that a sentence with a lower weight would almost be neglected for training and sentences with higher weights would be given more attention?
Wow!

Hi!
I used a WeightedRandomSampler to deal with an imbalanced dataset by using the following approach to assign weights:

weights = 1.0 / torch.tensor(counts, dtype=torch.float)

where counts is a numpy array that stores the number of samples for each class.
But when I run my dataloader, it still gives a lot of majority-class samples. Why so?

Please let me know in case you need any further information.
Thanks!

The weights tensor should contain the current weigth for each sample, not only the inverse class counts, as shown in this example.

1 Like

I did exactly the same thing as shown in that example. Here’s my code snippet:

_, counts = np.unique(label_list, return_counts=True)
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    sample_weights = weights[label_list]
    sampler = WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)

where label_list contains the label for each sample.
Thereafter, I feed this sampler into my dataloader.

Thanks for the clarification. I clearly misunderstood how you are passing the weights.
In this case it would be working.
Could you post a (small) reproducible code snippet so that we could have a look?

Sure!
There are two functions: sampler_ and loader, where the former is called by the latter

def sampler_(labels):
    _, counts = np.unique(labels, return_counts=True)
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    sample_weights = weights[labels]
    sampler = WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)
    return sampler
def loader(data_dir, transform, train_split=0.75):
    images, labels, _ = parse_data(data_dir)
    dataset = ImageDataset(imgages, labels, transform)
    dataset_size = len(dataset)
    indices = list(range(dataset_size))
    np.random.shuffle(indices) # shuffle the dataset before splitting into train and val
    split = int(np.floor(train_split * dataset_size))
    train_indices, val_indices = indices[:split], indices[split:]
    train_labels = [labels[x] for x in train_indices]
    val_labels = [labels[x] for x in val_indices]
    train_sampler, val_sampler = sampler_(train_labels), sampler_(val_labels)
    trainloader = DataLoader(dataset, sampler=train_sampler)
    valloader = DataLoader(dataset, sampler=val_sampler)
    return trainloader, valloader
for (feats, labels) in trainloader:
    print(labels)
Output: tensor([5, 5, 5, 5, 6, 5, 5, 6, 8, 5, 6, 5, 5, 5, 6, 5, 5, 6, 5, 5, 5, 5, 5, 5,
        6, 5, 6, 5, 0, 5, 5, 6])
tensor([5, 5, 5, 5, 5, 6, 5, 5, 5, 1, 5, 5, 5, 5, 5, 5, 5, 6, 5, 5, 5, 5, 5, 5,
        5, 5, 5, 5, 6, 5, 5, 5])
tensor([5, 6, 5, 5, 5, 5, 5, 6, 5, 5, 6, 5, 6, 5, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5,
        5, 6, 5, 5, 0, 5, 5, 5])
and so on (where 5 is the class in majority).

Please let me know your take on this.
Thanks!

The correspondence between the dataset splits and sample_weights is broken.
While train_labels and val_labels are corresponding to the shuffled indices, both samplers will just assign the weight to the data indices starting at 0 in a sequential order.

The easiest way to fix it, would be to wrap dataset in a Subset before passing them to the DataLoader:

trainloader = DataLoader(Subset(dataset, train_indices), sampler=train_sampler, batch_size=10)
valloader = DataLoader(Subset(dataset, val_indices), sampler=val_sampler, batch_size=10)
3 Likes