How to handle imbalanced classes

I guess you’re wrong !
defining :

sampler = WeightedRandomSampler(samples_weight, len(samples_weight))
train_dataset = torch.utils.data.TensorDataset(data, target)
train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, sampler=sampler)

without replacement = True , acts same as :

train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, shuffle=True)

and I cannot figure out how you actually handled unbalanced dataset ?!

No, a weighted sampler without replacement will not act as a random sampler which just shuffles.
It will still use the weights to draw the samples and will thus select samples with a higher weight earlier. However, it will not be able to “oversample” since it cannot re-draw the same sample is replacement=False is used.

I have posted several code snippets in this thread or have linked to other threads which give you a minimal, executable code. E.g. you can run this one and will see that each batch will be balanced when replacement=True. I don’t know where I claimed otherwise.

I also see that you’ve actually responded to the post sharing the code snippet, so did you actually run it?

(I’m @dieas93 with different account )
Here is the output of your code with replacement = False :

target train 0/1: 900/100
batch index 0, 0/1: 56/44
batch index 1, 0/1: 68/32
batch index 2, 0/1: 83/17
batch index 3, 0/1: 95/5
batch index 4, 0/1: 98/2
batch index 5, 0/1: 100/0
batch index 6, 0/1: 100/0
batch index 7, 0/1: 100/0
batch index 8, 0/1: 100/0
batch index 9, 0/1: 100/0

and with replacement = True :

target train 0/1: 900/100
batch index 0, 0/1: 47/53
batch index 1, 0/1: 47/53
batch index 2, 0/1: 53/47
batch index 3, 0/1: 51/49
batch index 4, 0/1: 44/56
batch index 5, 0/1: 59/41
batch index 6, 0/1: 50/50
batch index 7, 0/1: 49/51
batch index 8, 0/1: 45/55
batch index 9, 0/1: 47/53

with no replacement as you said , the network sees samples with higher weight earlier but does it really affect accuracy ? model sees all samples anyway which means samples with lower weight will have more impact on accuracy ( as we can see from outputs ).
even with replacement there is no guarantee that all samples come to play ! based on mathematical view point , higher epochs is needed to make sure model sees all samples at least once.
I think to be 100% sure that the data is balanced in each batch one should define custom Batch Sampler and of course there is no good docs on how to define one !
correct me if I’m wrong

I would guess the order of samples will have an effect on the accuracy, but it’s not creating balanced samples via oversampling and I haven’t seen results on using this approach on tackling an imbalanced dataset issue. The standard approach to create balanced batches is to use a weighted sampler with replacement=True as used in my code snippets.

That is correct and is a shortcoming from the replacement strategy or generally from oversampling minority classes. However, in practice it can still be useful to counter overfitting to the majority class(es) even if more epochs might be needed.

Sure, manually specifying the (balanced) batch indices is also a valid approach. I don’t know if any advantage would be expected and please share results in case you are comparing the custom sampler approach to the standard WeightedRandomSampler balancing approach.

1 Like

If I find a good docs on how to implement such sampler I would definitely try that as well :crossed_fingers:. I have strong feelings that custom sampler approach yields a better result since you can freely define batches containing samples from all categories with same frequency and even use all samples in an epoch.
by the way thanks for the immediate response :+1:

That sounds like a plan and I would be interested in the results to see if the “randomness” of a WeightedRandomSampler would help or if a defined balancing via a custom sampler would yield better results.

I think the best resource are the already implemented samplers from here.
E.g. take a look at the RandomSampler. You would derive your custom sampler from the Sampler base class and implement the __init__, __iter__ and __len__ methods.
In the __init__ you could already create the indices using a custom strategy or just store some arguments (e.g. the generator to seed your code etc.). The __iter__ should yield the indices and the __len__ would return the number of samples.

Also, check BatchSampler which can yield a batch of indices to the Dataset.__getitem__ and might fit your use case better.

1 Like

Check this out :

from itertools import cycle, zip_longest
from random import shuffle

class BalancedSampler(Sampler):

    def __init__(self, dataset):
        class_idxs = {}
        for idx , item in enumerate(dataset):
            if not int(item[1]) in class_idxs.keys():
                class_idxs[int(item[1])] = [idx]
            else:
                class_idxs[int(item[1])] += [idx]

        for key in class_idxs.keys():
            shuffle(class_idxs[key])


        self.seq = []
        for i in self.zip_cycle(*class_idxs.values()):
            self.seq += list(i)

    def __iter__(self) :
        for i in self.seq:
            yield i

    @staticmethod
    def zip_cycle(*iterables, empty_default=None):
        cycles = [cycle(i) for i in iterables]
        for _ in zip_longest(*iterables):
            yield tuple(next(i, empty_default) for i in$

    def __len__(self):
        return len(self.seq)

@ptrblck does Weighted Random selects all samples from given data set or subset of dataset.
say I have this tgt list (1,1,0,0,0,0,0,0,0,0,0)
11 datapoints , 1/2,1/9 as weights for each sample
What would be the number of returned samples ?
Will it try to select only those many samples so it maintains the above ratio ?

Your are defining the weight for each sample and are specifying the number of samples in the WeightedRandomSampler yourself via the weights and num_samples arguments.

what if i can’t hold my all data in the memory .
Like i have millions of json files which has {“x”:vectore ,“y”:label} fields and want to create sample weights. how do i create without hloding them into memory.

thanks

Note that you don’t need to load the actual data only the targets.
If the target values cannot be loaded at once you won’t be able to create the weights and would need to process the dataset in chunks.

Hi

This is how i am calculating weight per sample for weight_Sampler.

from tqdm.notebook import tqdm
import glob,json

def calculate_sample_weights(json_dir,su2id_json):

    json_files = sorted(glob.glob(os.path.join(json_dir,"*.json")))[:10000]
    sub2id = json.load(open(su2id_json,'r'))

    target = []
    for idx in tqdm(json_files):
        data = json.load(open(idx,'r'))
        label = data['subject']
        y = int(sub2id[label])
        target.append(y)

    class_sample_count = np.array(
        [len(np.where(target == t)[0]) for t in np.arange(45)])


    weight = 1. / (class_sample_count+1e-6)
    samples_weight = np.array([weight[t] for t in target])
    samples_weight = torch.from_numpy(samples_weight)
    samples_weight = samples_weight.double()

    return samples_weight

batch index 0

OrderedDict([(0, 6), (1, 11), (2, 7), (3, 9), (4, 8), (5, 7), (6, 5), (7, 3), (8, 5), (9, 6), (10, 6), (11, 8), (12, 7), (13, 7), (14, 5), (15, 4), (16, 4), (17, 7), (18, 2), (19, 6), (20, 4), (21, 5), (22, 8), (23, 8), (24, 7), (25, 3), (26, 4), (27, 9), (28, 4), (29, 2), (30, 5), (31, 3), (32, 3), (33, 3), (34, 5), (35, 1), (36, 7), (37, 4), (38, 8), (39, 4), (40, 7), (41, 8), (42, 8), (43, 7), (44, 6)])

batch index 1

OrderedDict([(0, 12), (1, 4), (2, 5), (3, 3), (4, 6), (5, 5), (6, 4), (7, 5), (8, 7), (9, 5), (10, 6), (11, 8), (12, 7), (13, 6), (14, 3), (15, 4), (16, 6), (17, 4), (18, 5), (19, 7), (20, 6), (21, 5), (22, 4), (23, 4), (24, 3), (25, 5), (26, 9), (27, 9), (28, 7), (29, 1), (30, 5), (31, 3), (32, 4), (33, 9), (34, 6), (35, 5), (36, 8), (37, 11), (38, 6), (39, 6), (40, 7), (41, 6), (42, 3), (43, 5), (44, 7)])

batch index 2

OrderedDict([(0, 4), (1, 5), (2, 3), (3, 5), (4, 4), (5, 6), (6, 7), (7, 4), (8, 4), (9, 7), (10, 8), (11, 2), (12, 4), (13, 3), (14, 7), (15, 6), (16, 6), (17, 4), (18, 7), (19, 5), (20, 6), (21, 5), (22, 4), (23, 4), (24, 2), (25, 6), (26, 10), (27, 8), (28, 5), (29, 7), (30, 8), (31, 10), (32, 2), (33, 7), (34, 10), (35, 8), (36, 5), (37, 5), (38, 7), (39, 6), (40, 6), (41, 5), (42, 10), (43, 6), (44, 3)])

batch index 3

OrderedDict([(0, 8), (1, 6), (2, 4), (3, 6), (4, 3), (5, 5), (6, 3), (7, 9), (8, 4), (9, 7), (10, 5), (11, 8), (12, 5), (13, 6), (14, 6), (15, 3), (16, 5), (17, 5), (18, 5), (19, 5), (20, 6), (21, 8), (22, 4), (23, 11), (24, 8), (25, 5), (26, 6), (27, 1), (28, 2), (29, 4), (30, 3), (31, 5), (32, 6), (33, 4), (34, 6), (35, 7), (36, 8), (37, 10), (38, 9), (39, 6), (40, 12), (41, 4), (42, 3), (43, 4), (44, 6)])

batch index 4

OrderedDict([(0, 3), (1, 7), (2, 4), (3, 5), (4, 3), (5, 3), (6, 8), (7, 9), (8, 6), (9, 4), (10, 8), (11, 8), (12, 2), (13, 4), (14, 7), (15, 4), (16, 5), (17, 6), (18, 1), (19, 7), (20, 10), (21, 10), (22, 5), (23, 4), (24, 7), (25, 6), (26, 3), (27, 4), (28, 8), (29, 7), (30, 11), (31, 5), (32, 8), (33, 5), (34, 4), (35, 7), (36, 8), (37, 3), (38, 6), (39, 2), (40, 5), (41, 5), (42, 7), (43, 5), (44, 7)])

batch index 5

OrderedDict([(0, 5), (1, 3), (2, 7), (3, 3), (4, 4), (5, 12), (6, 7), (7, 2), (8, 6), (9, 8), (10, 2), (11, 7), (12, 11), (13, 6), (14, 5), (15, 2), (16, 5), (17, 7), (18, 6), (19, 5), (20, 5), (21, 8), (22, 1), (23, 7), (24, 6), (25, 5), (26, 6), (27, 6), (28, 8), (30, 5), (31, 9), (32, 5), (33, 4), (34, 4), (35, 8), (36, 6), (37, 5), (38, 4), (39, 9), (40, 4), (41, 5), (42, 8), (43, 9), (44, 6)])

here i am not getting equal or close to equal samples per class in each batch.
could you tell what is wrong in the flow?
here first element in dict is class label and other is n_Samples for that class.

do i need to pass the
sampler to train ,val and test dataloader ?

I don’t know, but you could use this code as a small example to see how a WeightedRandomSampler is used to create balanced batches for a binary use case.

You should pass the sampler to the DataLoader using the corresponding targets.
I.e. if the sampler used the training targets to calculate its weights, it should be used together with the training dataset in the training DataLoader.

i am refering your code only for multiclass .
for binary it gives approx equal samples, but for multiclass its not giving equal samples.
i guess for multiclass it gives more samples for class having less weights.

That’s not the case and you can easily extend the example to a multiclass use case, which still yields balanced examples:

numDataPoints = 10000
data_dim = 5
bs = 1000

# Create dummy data with class imbalance 9 to 1
data = torch.randn(numDataPoints, data_dim)
target = np.hstack((np.zeros(int(numDataPoints * 0.5), dtype=np.int32),
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32),
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 2,
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 3,
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 4,
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 5))

class_sample_count = np.array(
    [len(np.where(target == t)[0]) for t in np.unique(target)])
print(class_sample_count)
# [5000 1000 1000 1000 1000 1000]

weight = 1. / class_sample_count
print(weight)
# [0.0002 0.001  0.001  0.001  0.001  0.001 ]
samples_weight = np.array([weight[t] for t in target])

samples_weight = torch.from_numpy(samples_weight)
sampler = torch.utils.data.sampler.WeightedRandomSampler(samples_weight, len(samples_weight))

target = torch.from_numpy(target).long()
train_dataset = torch.utils.data.TensorDataset(data, target)

train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, sampler=sampler)

for i, (data, target) in enumerate(train_loader):
    print("batch index: {}, class count: {}".format(
        i, [len((target == i).nonzero()) for i in range(len(target.unique()))]))
batch index: 0, class count: [170, 170, 159, 176, 144]
batch index: 1, class count: [163, 166, 177, 155, 177]
batch index: 2, class count: [187, 171, 158, 153, 175]
batch index: 3, class count: [157, 153, 188, 162, 187]
batch index: 4, class count: [158, 166, 161, 167, 182]
batch index: 5, class count: [176, 168, 158, 169, 158]
batch index: 6, class count: [160, 159, 159, 169, 182]
batch index: 7, class count: [165, 158, 180, 154, 169]
batch index: 8, class count: [164, 160, 174, 168, 151]
batch index: 9, class count: [157, 194, 157, 169, 174]

could you check with smaller batch size because i cant fit larger batch size.
that’s only difference i can see in your and mine code.

with larger batch size (8000) i can get approx equal samples.

OrderedDict([(0, 172), (1, 170), (2, 183), (3, 165), (4, 168), (5, 192), (6, 176), (7, 168), (8, 187), (9, 174), (10, 172), (11, 188), (12, 186), (13, 176), (14, 175), (15, 139), (16, 178), (17, 159), (18, 162), (19, 168), (20, 177), (21, 176), (22, 160), (23, 184), (24, 196), (25, 189), (26, 183), (27, 184), (28, 178), (29, 201), (30, 184), (31, 160), (32, 196), (33, 182), (34, 197), (35, 179), (36, 175), (37, 176), (38, 175), (39, 179), (40, 194), (41, 178), (42, 177), (43, 194), (44, 168)])

Smaller batch sizes will create more noise, since the weighted sampling is a random process.
If you collect the batches and check the stats the epoch would still show a balanced usage:

numDataPoints = 1000
data_dim = 5
bs = 5

# Create dummy data with class imbalance 9 to 1
data = torch.randn(numDataPoints, data_dim)
target = np.hstack((np.zeros(int(numDataPoints * 0.5), dtype=np.int32),
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32),
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 2,
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 3,
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 4,
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 5))

class_sample_count = np.array(
    [len(np.where(target == t)[0]) for t in np.unique(target)])
print(class_sample_count)
# [5000 1000 1000 1000 1000 1000]

weight = 1. / class_sample_count
print(weight)
# [0.0002 0.001  0.001  0.001  0.001  0.001 ]
samples_weight = np.array([weight[t] for t in target])

samples_weight = torch.from_numpy(samples_weight)
sampler = torch.utils.data.sampler.WeightedRandomSampler(samples_weight, len(samples_weight))

target = torch.from_numpy(target).long()
train_dataset = torch.utils.data.TensorDataset(data, target)

train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, sampler=sampler)

freqs = np.zeros(len(target.unique()))
for i, (data, t) in enumerate(train_loader):
    f = [len((t == i).nonzero()) for i in range(len(target.unique()))]
    print("batch index: {}, class count: {}".format(i, f))
    freqs += np.array(f)

print(freqs)
# [164. 185. 185. 139. 159. 168.]