How to handle imbalanced classes

Hi ptrblck, thanks for your kind explanation on WeightedRandomSampler. I am trying to use it to resample the unbalanced dataset, but the dataloader output the most frequent class without any improvement. My dataset loads a dict-like batch (e.g., data[‘image’]) and it does not include label information. here’s my snippets for weights:

  weight_list = [1./len(np.where(submeta['commands']==i)[0]) for i in range(3)]
  # weight_list = [1, 1, 1]
  # submeta['commands'] is a sequence of labels of each index
  submeta['weights'] = torch.from_numpy(np.array([weight_list[i] for i in submeta['commands']]))

How do you measure the imbalance if the DataLoader doesn’t output any label information?

I have a separable variable that stores the label. There are three classes 0, 1, 2 stored in submeta['commands']=[0,1,0,...,2]. So I want to know if it should be included in the dataloader for sampler to work? Or it is ok as long as the weights calculated from the label have shared index
with data? Hope I get myself across, and thanks for your kind reply!

You don’t need to return these labels, since you already created sample weights for each index.
If weight_list is [1, 1, 1] before you are assigning it to each sample index, the weight calculation is wrong or your dataset is perfectly balanced.

Really thank you ptrblck! A late update: there is some problem in my weight index (relative vs absolute index), I followed your advice from another similar topic and check the index carefully and bang! Thanks :smile:

As a follow-up question, I wonder how to implement class-balanced sampling when using BCELoss for minibatch updates. The first challenge is the shuffling of classes VS fixed weight, but reimplementing the DataLoader can fix it. I am more uncertain about the second challenge, if I do not set drop_last=True, is there a way to keep the class weights for the last batch which might have fewer samples? (Same problem for CrossEntropyLoss)

BCEWithLogitsLoss seems to be better because it allows for class weights rather than sample weights.

I guess you’re wrong !
defining :

sampler = WeightedRandomSampler(samples_weight, len(samples_weight))
train_dataset = torch.utils.data.TensorDataset(data, target)
train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, sampler=sampler)

without replacement = True , acts same as :

train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, shuffle=True)

and I cannot figure out how you actually handled unbalanced dataset ?!

No, a weighted sampler without replacement will not act as a random sampler which just shuffles.
It will still use the weights to draw the samples and will thus select samples with a higher weight earlier. However, it will not be able to “oversample” since it cannot re-draw the same sample is replacement=False is used.

I have posted several code snippets in this thread or have linked to other threads which give you a minimal, executable code. E.g. you can run this one and will see that each batch will be balanced when replacement=True. I don’t know where I claimed otherwise.

I also see that you’ve actually responded to the post sharing the code snippet, so did you actually run it?

(I’m @dieas93 with different account )
Here is the output of your code with replacement = False :

target train 0/1: 900/100
batch index 0, 0/1: 56/44
batch index 1, 0/1: 68/32
batch index 2, 0/1: 83/17
batch index 3, 0/1: 95/5
batch index 4, 0/1: 98/2
batch index 5, 0/1: 100/0
batch index 6, 0/1: 100/0
batch index 7, 0/1: 100/0
batch index 8, 0/1: 100/0
batch index 9, 0/1: 100/0

and with replacement = True :

target train 0/1: 900/100
batch index 0, 0/1: 47/53
batch index 1, 0/1: 47/53
batch index 2, 0/1: 53/47
batch index 3, 0/1: 51/49
batch index 4, 0/1: 44/56
batch index 5, 0/1: 59/41
batch index 6, 0/1: 50/50
batch index 7, 0/1: 49/51
batch index 8, 0/1: 45/55
batch index 9, 0/1: 47/53

with no replacement as you said , the network sees samples with higher weight earlier but does it really affect accuracy ? model sees all samples anyway which means samples with lower weight will have more impact on accuracy ( as we can see from outputs ).
even with replacement there is no guarantee that all samples come to play ! based on mathematical view point , higher epochs is needed to make sure model sees all samples at least once.
I think to be 100% sure that the data is balanced in each batch one should define custom Batch Sampler and of course there is no good docs on how to define one !
correct me if I’m wrong

I would guess the order of samples will have an effect on the accuracy, but it’s not creating balanced samples via oversampling and I haven’t seen results on using this approach on tackling an imbalanced dataset issue. The standard approach to create balanced batches is to use a weighted sampler with replacement=True as used in my code snippets.

That is correct and is a shortcoming from the replacement strategy or generally from oversampling minority classes. However, in practice it can still be useful to counter overfitting to the majority class(es) even if more epochs might be needed.

Sure, manually specifying the (balanced) batch indices is also a valid approach. I don’t know if any advantage would be expected and please share results in case you are comparing the custom sampler approach to the standard WeightedRandomSampler balancing approach.

1 Like

If I find a good docs on how to implement such sampler I would definitely try that as well :crossed_fingers:. I have strong feelings that custom sampler approach yields a better result since you can freely define batches containing samples from all categories with same frequency and even use all samples in an epoch.
by the way thanks for the immediate response :+1:

That sounds like a plan and I would be interested in the results to see if the “randomness” of a WeightedRandomSampler would help or if a defined balancing via a custom sampler would yield better results.

I think the best resource are the already implemented samplers from here.
E.g. take a look at the RandomSampler. You would derive your custom sampler from the Sampler base class and implement the __init__, __iter__ and __len__ methods.
In the __init__ you could already create the indices using a custom strategy or just store some arguments (e.g. the generator to seed your code etc.). The __iter__ should yield the indices and the __len__ would return the number of samples.

Also, check BatchSampler which can yield a batch of indices to the Dataset.__getitem__ and might fit your use case better.

1 Like

Check this out :

from itertools import cycle, zip_longest
from random import shuffle

class BalancedSampler(Sampler):

    def __init__(self, dataset):
        class_idxs = {}
        for idx , item in enumerate(dataset):
            if not int(item[1]) in class_idxs.keys():
                class_idxs[int(item[1])] = [idx]
            else:
                class_idxs[int(item[1])] += [idx]

        for key in class_idxs.keys():
            shuffle(class_idxs[key])


        self.seq = []
        for i in self.zip_cycle(*class_idxs.values()):
            self.seq += list(i)

    def __iter__(self) :
        for i in self.seq:
            yield i

    @staticmethod
    def zip_cycle(*iterables, empty_default=None):
        cycles = [cycle(i) for i in iterables]
        for _ in zip_longest(*iterables):
            yield tuple(next(i, empty_default) for i in$

    def __len__(self):
        return len(self.seq)

@ptrblck does Weighted Random selects all samples from given data set or subset of dataset.
say I have this tgt list (1,1,0,0,0,0,0,0,0,0,0)
11 datapoints , 1/2,1/9 as weights for each sample
What would be the number of returned samples ?
Will it try to select only those many samples so it maintains the above ratio ?

Your are defining the weight for each sample and are specifying the number of samples in the WeightedRandomSampler yourself via the weights and num_samples arguments.

what if i can’t hold my all data in the memory .
Like i have millions of json files which has {“x”:vectore ,“y”:label} fields and want to create sample weights. how do i create without hloding them into memory.

thanks

Note that you don’t need to load the actual data only the targets.
If the target values cannot be loaded at once you won’t be able to create the weights and would need to process the dataset in chunks.

Hi

This is how i am calculating weight per sample for weight_Sampler.

from tqdm.notebook import tqdm
import glob,json

def calculate_sample_weights(json_dir,su2id_json):

    json_files = sorted(glob.glob(os.path.join(json_dir,"*.json")))[:10000]
    sub2id = json.load(open(su2id_json,'r'))

    target = []
    for idx in tqdm(json_files):
        data = json.load(open(idx,'r'))
        label = data['subject']
        y = int(sub2id[label])
        target.append(y)

    class_sample_count = np.array(
        [len(np.where(target == t)[0]) for t in np.arange(45)])


    weight = 1. / (class_sample_count+1e-6)
    samples_weight = np.array([weight[t] for t in target])
    samples_weight = torch.from_numpy(samples_weight)
    samples_weight = samples_weight.double()

    return samples_weight

batch index 0

OrderedDict([(0, 6), (1, 11), (2, 7), (3, 9), (4, 8), (5, 7), (6, 5), (7, 3), (8, 5), (9, 6), (10, 6), (11, 8), (12, 7), (13, 7), (14, 5), (15, 4), (16, 4), (17, 7), (18, 2), (19, 6), (20, 4), (21, 5), (22, 8), (23, 8), (24, 7), (25, 3), (26, 4), (27, 9), (28, 4), (29, 2), (30, 5), (31, 3), (32, 3), (33, 3), (34, 5), (35, 1), (36, 7), (37, 4), (38, 8), (39, 4), (40, 7), (41, 8), (42, 8), (43, 7), (44, 6)])

batch index 1

OrderedDict([(0, 12), (1, 4), (2, 5), (3, 3), (4, 6), (5, 5), (6, 4), (7, 5), (8, 7), (9, 5), (10, 6), (11, 8), (12, 7), (13, 6), (14, 3), (15, 4), (16, 6), (17, 4), (18, 5), (19, 7), (20, 6), (21, 5), (22, 4), (23, 4), (24, 3), (25, 5), (26, 9), (27, 9), (28, 7), (29, 1), (30, 5), (31, 3), (32, 4), (33, 9), (34, 6), (35, 5), (36, 8), (37, 11), (38, 6), (39, 6), (40, 7), (41, 6), (42, 3), (43, 5), (44, 7)])

batch index 2

OrderedDict([(0, 4), (1, 5), (2, 3), (3, 5), (4, 4), (5, 6), (6, 7), (7, 4), (8, 4), (9, 7), (10, 8), (11, 2), (12, 4), (13, 3), (14, 7), (15, 6), (16, 6), (17, 4), (18, 7), (19, 5), (20, 6), (21, 5), (22, 4), (23, 4), (24, 2), (25, 6), (26, 10), (27, 8), (28, 5), (29, 7), (30, 8), (31, 10), (32, 2), (33, 7), (34, 10), (35, 8), (36, 5), (37, 5), (38, 7), (39, 6), (40, 6), (41, 5), (42, 10), (43, 6), (44, 3)])

batch index 3

OrderedDict([(0, 8), (1, 6), (2, 4), (3, 6), (4, 3), (5, 5), (6, 3), (7, 9), (8, 4), (9, 7), (10, 5), (11, 8), (12, 5), (13, 6), (14, 6), (15, 3), (16, 5), (17, 5), (18, 5), (19, 5), (20, 6), (21, 8), (22, 4), (23, 11), (24, 8), (25, 5), (26, 6), (27, 1), (28, 2), (29, 4), (30, 3), (31, 5), (32, 6), (33, 4), (34, 6), (35, 7), (36, 8), (37, 10), (38, 9), (39, 6), (40, 12), (41, 4), (42, 3), (43, 4), (44, 6)])

batch index 4

OrderedDict([(0, 3), (1, 7), (2, 4), (3, 5), (4, 3), (5, 3), (6, 8), (7, 9), (8, 6), (9, 4), (10, 8), (11, 8), (12, 2), (13, 4), (14, 7), (15, 4), (16, 5), (17, 6), (18, 1), (19, 7), (20, 10), (21, 10), (22, 5), (23, 4), (24, 7), (25, 6), (26, 3), (27, 4), (28, 8), (29, 7), (30, 11), (31, 5), (32, 8), (33, 5), (34, 4), (35, 7), (36, 8), (37, 3), (38, 6), (39, 2), (40, 5), (41, 5), (42, 7), (43, 5), (44, 7)])

batch index 5

OrderedDict([(0, 5), (1, 3), (2, 7), (3, 3), (4, 4), (5, 12), (6, 7), (7, 2), (8, 6), (9, 8), (10, 2), (11, 7), (12, 11), (13, 6), (14, 5), (15, 2), (16, 5), (17, 7), (18, 6), (19, 5), (20, 5), (21, 8), (22, 1), (23, 7), (24, 6), (25, 5), (26, 6), (27, 6), (28, 8), (30, 5), (31, 9), (32, 5), (33, 4), (34, 4), (35, 8), (36, 6), (37, 5), (38, 4), (39, 9), (40, 4), (41, 5), (42, 8), (43, 9), (44, 6)])

here i am not getting equal or close to equal samples per class in each batch.
could you tell what is wrong in the flow?
here first element in dict is class label and other is n_Samples for that class.

do i need to pass the
sampler to train ,val and test dataloader ?