How to handle imbalanced classes

kkorovesis · December 17, 2017, 5:23pm

I am trying to balance my data for a multi-classes classification task to get better scores, using weights and torch.utils.data.sampler.WeightedRandomSampler() I get an error that I don’t understand. Is there any other way to handle imbalanced classes easily ? Here is a snippet of my code and the error in hand:

    .
train_set = SentimentDataset(file=TRAIN_DATA, word2idx=word2idx, tword2idx=tword2idx,
                             max_length=0, max_topic_length=0, topic_bs=True)
val_set = SentimentDataset(file=VAL_DATA, word2idx=word2idx, tword2idx=tword2idx,
                           max_length=0, max_topic_length=0, topic_bs=True)

_weights =  torch.FloatTensor(train_set.weights) # train_set.weights : [296, 3381, 12882, 12857, 1016]

_weights = _weights.view(1, 5)
_weights = _weights.double()

sampler = torch.utils.data.sampler.WeightedRandomSampler(_weights, BATCH_SIZE)

loader_train = DataLoader(train_set, batch_size=BATCH_SIZE,
                          shuffle=False, sampler=sampler, num_workers=4)

loader_val = DataLoader(val_set, batch_size=BATCH_SIZE,
                        shuffle=False, sampler=sampler, num_workers=4)

model = RNN(embeddings, num_classes=num_classes, **_hparams)

model.cuda()

criterion = torch.nn.CrossEntropyLoss()
parameters = filter(lambda p: p.requires_grad, model.parameters())
optimizer = torch.optim.Adam(parameters)
    
    # TRAIN

...

    class SentimentDataset(Dataset):

        def __init__(self, file, max_length, max_topic_length, word2idx, tword2idx, topic_bs):
        	...

		self.data = [SocialTokenizer(lowercase=True).tokenize(x)for x in self.data]
	        self.topics = [SocialTokenizer(lowercase=True).tokenize(x) for x in self.topics]

	        self.label_encoder = preprocessing.LabelEncoder()
	        self.label_encoder = self.label_encoder.fit(self.labels)
	        self.label_count = Counter(self.labels)

	        self.weights = [self.label_count['-2'], self.label_count['-1'],
	                        self.label_count['0'], self.label_count['1'],
	                        self.label_count['2']]
	        ...

    	def __getitem__(self, index):

   		        sample, label, topic = self.data[index], self.labels[index], self.topics[index]


  File "/home/kostas/anaconda3/envs/pytorch_env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 40, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/kostas/Gitlab/SemTest/models/datasets.py", line 108, in __getitem__
    sample, label, topic = self.data[index], self.labels[index], self.topics[index]
TypeError: list indices must be integers or slices, not torch.LongTensor

ptrblck · December 17, 2017, 8:04pm

What kind of error do you get?

Here is a sample code, which should work fine:

numDataPoints = 1000
data_dim = 5
bs = 100

# Create dummy data with class imbalance 9 to 1
data = torch.FloatTensor(numDataPoints, data_dim)
target = np.hstack((np.zeros(int(numDataPoints * 0.9), dtype=np.int32),
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32)))

print 'target train 0/1: {}/{}'.format(
    len(np.where(target == 0)[0]), len(np.where(target == 1)[0]))

class_sample_count = np.array(
    [len(np.where(target == t)[0]) for t in np.unique(target)])
weight = 1. / class_sample_count
samples_weight = np.array([weight[t] for t in target])

samples_weight = torch.from_numpy(samples_weight)
samples_weigth = samples_weight.double()
sampler = WeightedRandomSampler(samples_weight, len(samples_weight))

target = torch.from_numpy(target).long()
train_dataset = torch.utils.data.TensorDataset(data, target)

train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, sampler=sampler)

for i, (data, target) in enumerate(train_loader):
    print "batch index {}, 0/1: {}/{}".format(
        i,
        len(np.where(target.numpy() == 0)[0]),
        len(np.where(target.numpy() == 1)[0]))

Chahrazad · December 18, 2017, 5:59am

what is the interpretation of weight here?

weight = 1. / class_sample_count

shouldn’t the weight be the class frequency ?

weight = numDataPoints / class_sample_count

kkorovesis · December 18, 2017, 3:17pm

I am guessing that the problem is that my train_set consists of 6 data and 1 target, instead of 1 data and 1 target. In your examples you have only (data, target). So even if I had fixed weights they wouldn’t be multiplied with the correct data. In comments you can see what my train_set contains. I need all 6 inputs in my model therefore I can’t change that. Is there a way around it ?

train_set = SentimentDataset(file=TRAIN_DATA, word2idx=word2idx, tword2idx=tword2idx,
                             max_length=0, max_topic_length=0, topic_bs=True)

 # train_set: message, topic, len(self.data[index]), len(self.topics[index]), self.weights, index, label

_weights = [0.8,0.8,0.3,0.4,0.8]

sampler = torch.utils.data.sampler.WeightedRandomSampler(_weights, BATCH_SIZE)

jusjusjus · March 4, 2018, 11:05pm

Thanks for the example, super helpful! Here’s some syntactic sugar (or how you’d wanna call it):

Suggestion:

class_sample_count = np.unique(target, return_counts=True)[1]

Suggestion:

samples_weight = weight[target]

Casting to double is not needed anymore, and probably wasn’t three months ago considering the typo.

ptrblck · March 5, 2018, 6:08am

Thanks for the suggestions!
It’s always nice to clean up the code a bit.

Regarding the last correction, I think the cast was necessary in an older version despite the typo.
I was referring to this post. Good to know it’s apparently not needed anymore.

surojit_sengupta · November 24, 2018, 5:05am

Hello,

I have similar class imbalance problems where my class counts look like this:

{0: 112328,
1: 6321,
2: 169578,
3: 7140,
4: 3704,
5: 6600,
6: 3438,
7: 1155,
8: 1157,
9: 4528}

The following are what I tried:

setting the ignore index flag to the index of the most dominant class above, in the crossEntropy loss function. The overall accuracy went down and the model somehow started predicting the dominant class more.
Tried setting weights in the CrossEntropyLoss function and it bumped up the overall accuracy but it still predicts the dominant class way better than the other classes.

Do we have any other work around? Thanks.

ptrblck · November 24, 2018, 6:08am

You could also try to oversample your minority classes using a WeightedRandomSampler.

surojit_sengupta · November 24, 2018, 6:14am

That comes with a dataloader? is there a way to use it without the torch utils dataloader
I have written my own batch generator as

def getBatch(batch_size, train_data):
    #random.shuffle(train_data)
    sindex = 0
    eindex = batch_size
    while eindex < len(train_data):
        batch = train_data[sindex: eindex]
        temp = eindex
        eindex = eindex + batch_size
        sindex = temp
        yield batch
    
    if eindex >= len(train_data):
        batch = train_data[sindex:]
        yield batch

Regards

surojit_sengupta · November 24, 2018, 9:47am

To use the dataloader, I may have to start considering a word at a time as advised by you earlier, instead of going for a sentence as an image.

surojit_sengupta · November 24, 2018, 1:34pm

Well, I tried using the dataloader given with pytorch and am not sure of the weights the sampler assigns to the classes or maybe, the inner workings of the dataloader sampler aren’t clear to me

sequence: tensor([ 8956, 22184, 16504, 148, 727, 14016, 12722, 43, 12532])
targets: tensor([4, 7, 5, 7, 7, 7, 5, 7, 7])

Can you help?

Regards

ptrblck · November 24, 2018, 8:05pm

The weights should be defined as the inverse class frequency defined for each sample.
Here is a small example. Could you try that or check for differences with your code?

surojit_sengupta · November 27, 2018, 6:21am

Hello Peter,

This is a conll2003 corpus and a sentence looks like one below:

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O

All I have to do is extract the word and it’s NER tag in the right extreme and make my sentence look like one below:

([‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’] → One Image
[‘B-ORG\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’, ‘O\n’, ‘B-MISC\n’, ‘O\n’, ‘O\n’]) → set of tags/labels for an image

Each unique word and tag has an index like so:

{‘B-LOC\n’: 4,
‘B-MISC\n’: 9,
‘B-ORG\n’: 6,
‘B-PER\n’: 7,
‘I-LOC\n’: 8,
‘I-MISC\n’: 3,
‘I-ORG\n’: 1,
‘I-PER\n’: 5,
‘O\n’: 2,
‘PAD’: 0}

Here each word in a sentence is converted to its indices like so:

[ 0, 10397, 21273, 16075, 20124, 15819, 2729, 13298, 14387]

; and is embedded through a nn.Embedding layer (so that each word is a 50 dimensional vector)

and now a single sentence becomes a 2D image. But, instead of having a single label to an image, I have a list of labels to an image. These labels are the NER tags of each word. The problem is, my data-set has a lot of words of ‘O\n’ class as pointed in the comment earlier and so, my model tends to predict the dominant class (typical class imbalance problem).
So, I need to balance these classes.

The code to calculate weights:

indexed_counts #frequency of each class

{0: 112328,
1: 3704,
2: 169578,
3: 1155,
4: 7140,
5: 4528,
6: 6321,
7: 6600,
8: 1157,
9: 3438}

tag_weights = {}
for key in indexed_counts.keys():
    tag_weights[key] = 1/indexed_counts[key]
sampler = [i[1] for i in sorted(tag_weights.items())]
sampler

[8.902499821950003e-06,
0.0002699784017278618,
5.896991355010673e-06,
0.0008658008658008658,
0.00014005602240896358,
0.00022084805653710247,
0.00015820281601012498,
0.00015151515151515152,
0.000864304235090752,
0.00029086678301337986]

(As a side note, am passing this as my weights tensor for the cross entropy loss function)

Just for you to compare that the tag_weights and the sampler is in the same order of the weight classes
tag_weights

{0: 8.902499821950003e-06, 1: 0.0002699784017278618, 2: 5.896991355010673e-06, 3: 0.0008658008658008658, 4: 0.00014005602240896358, 5: 0.00022084805653710247, 6: 0.00015820281601012498, 7: 0.00015151515151515152, 8: 0.000864304235090752, 9: 0.00029086678301337986}

Coming to your suggestion of splitting by classes, splitting a sentence without maintaining the sequence of words before and after, would imply the context of a word might be lost. The model needs to learn the NER tags of a given word given the contextual meaning as well (words appearing before and after a given word).

ptrblck · November 27, 2018, 12:26pm

This use case seems to be a bit more complicated, since you have multiple class indices for each sentence.
Even if we sample each sentence, we would need to calculate the weight for it.
Since the sentence lengths are most likely different, we can’t just multiply the word weights, as this would penalize long sentences.

Assuming we have a few sentences and a dictionary of two words with weights 0.1 and 0.01.
Ideally we would like to create batches with the same number of word counts.
If the sentences look like this:

s0 = [w0, w1]                 -> [0.1, 0.01]
s1 = [w0, w1, w1, w1, w1, w1] -> [0.1, 0.01, 0.01, 0.01, 0.01, 0.01]
s2 = [w1]                     -> [0.01]

, we would need to somehow calculate the valid weight for each sentence.
I haven’t tried it yet, but do you think it would work, if you multiply the word weights and normalize if with the Nth root, where N is the sentence length?
This would yield sentence weights of:

s0 => 0.0316
s1 => 0.0147
s2 => 0.01

Let me know, if I’m missing something.

surojit_sengupta · November 27, 2018, 12:32pm

Am using sentence windowing technique to fix the length of a sentence that my model sees. So, variable length input shouldn’t be an issue

example:

input sentence = (‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’)

output_windows =

(‘’, ‘’, ‘’, ‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’),
(‘’, ‘’, ‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’),
(‘’, ‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’),
(‘EU’, ‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’),
(‘rejects’, ‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’, ‘’),
(‘German’, ‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’, ‘’, ‘’),
(‘call’, ‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’, ‘’, ‘’, ‘’),
(‘to’, ‘boycott’, ‘British’, ‘lamb’, ‘.’, ‘’, ‘’, ‘’, ‘’)]

surojit_sengupta · November 27, 2018, 12:34pm

Couldn’t understand this line. Any simpler example to understand this? May be that will help me understand the application of class weights better.

Regards

ptrblck · November 27, 2018, 12:39pm

Using simple multiplication of word weights short sentences will have a much higher weights than longer ones, since we multiply with a number in [0, 1].

However if your sentences have a fixed length, try to create sentence weights multiplying the word weights, and see if the class frequencies are approx. equal in each batch.

surojit_sengupta · November 27, 2018, 12:44pm

So this means the model will end up weighing the sentences instead of weighing the classes of each word in a sentence?

ptrblck · November 27, 2018, 1:25pm

Yes, exactly. The sentences will be weighted using the word frequencies.
Since you are using a sliding window approach, you would need to pre-compute the sentence weights using the word frequencies.

surojit_sengupta · November 27, 2018, 1:31pm

And this would also mean that a sentence with a lower weight would almost be neglected for training and sentences with higher weights would be given more attention?
Wow!