How to handle imbalanced classes

I don’t know, but you could use this code as a small example to see how a WeightedRandomSampler is used to create balanced batches for a binary use case.

You should pass the sampler to the DataLoader using the corresponding targets.
I.e. if the sampler used the training targets to calculate its weights, it should be used together with the training dataset in the training DataLoader.

i am refering your code only for multiclass .
for binary it gives approx equal samples, but for multiclass its not giving equal samples.
i guess for multiclass it gives more samples for class having less weights.

That’s not the case and you can easily extend the example to a multiclass use case, which still yields balanced examples:

numDataPoints = 10000
data_dim = 5
bs = 1000

# Create dummy data with class imbalance 9 to 1
data = torch.randn(numDataPoints, data_dim)
target = np.hstack((np.zeros(int(numDataPoints * 0.5), dtype=np.int32),
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32),
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 2,
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 3,
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 4,
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 5))

class_sample_count = np.array(
    [len(np.where(target == t)[0]) for t in np.unique(target)])
print(class_sample_count)
# [5000 1000 1000 1000 1000 1000]

weight = 1. / class_sample_count
print(weight)
# [0.0002 0.001  0.001  0.001  0.001  0.001 ]
samples_weight = np.array([weight[t] for t in target])

samples_weight = torch.from_numpy(samples_weight)
sampler = torch.utils.data.sampler.WeightedRandomSampler(samples_weight, len(samples_weight))

target = torch.from_numpy(target).long()
train_dataset = torch.utils.data.TensorDataset(data, target)

train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, sampler=sampler)

for i, (data, target) in enumerate(train_loader):
    print("batch index: {}, class count: {}".format(
        i, [len((target == i).nonzero()) for i in range(len(target.unique()))]))
batch index: 0, class count: [170, 170, 159, 176, 144]
batch index: 1, class count: [163, 166, 177, 155, 177]
batch index: 2, class count: [187, 171, 158, 153, 175]
batch index: 3, class count: [157, 153, 188, 162, 187]
batch index: 4, class count: [158, 166, 161, 167, 182]
batch index: 5, class count: [176, 168, 158, 169, 158]
batch index: 6, class count: [160, 159, 159, 169, 182]
batch index: 7, class count: [165, 158, 180, 154, 169]
batch index: 8, class count: [164, 160, 174, 168, 151]
batch index: 9, class count: [157, 194, 157, 169, 174]

could you check with smaller batch size because i cant fit larger batch size.
that’s only difference i can see in your and mine code.

with larger batch size (8000) i can get approx equal samples.

OrderedDict([(0, 172), (1, 170), (2, 183), (3, 165), (4, 168), (5, 192), (6, 176), (7, 168), (8, 187), (9, 174), (10, 172), (11, 188), (12, 186), (13, 176), (14, 175), (15, 139), (16, 178), (17, 159), (18, 162), (19, 168), (20, 177), (21, 176), (22, 160), (23, 184), (24, 196), (25, 189), (26, 183), (27, 184), (28, 178), (29, 201), (30, 184), (31, 160), (32, 196), (33, 182), (34, 197), (35, 179), (36, 175), (37, 176), (38, 175), (39, 179), (40, 194), (41, 178), (42, 177), (43, 194), (44, 168)])

Smaller batch sizes will create more noise, since the weighted sampling is a random process.
If you collect the batches and check the stats the epoch would still show a balanced usage:

numDataPoints = 1000
data_dim = 5
bs = 5

# Create dummy data with class imbalance 9 to 1
data = torch.randn(numDataPoints, data_dim)
target = np.hstack((np.zeros(int(numDataPoints * 0.5), dtype=np.int32),
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32),
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 2,
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 3,
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 4,
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32) * 5))

class_sample_count = np.array(
    [len(np.where(target == t)[0]) for t in np.unique(target)])
print(class_sample_count)
# [5000 1000 1000 1000 1000 1000]

weight = 1. / class_sample_count
print(weight)
# [0.0002 0.001  0.001  0.001  0.001  0.001 ]
samples_weight = np.array([weight[t] for t in target])

samples_weight = torch.from_numpy(samples_weight)
sampler = torch.utils.data.sampler.WeightedRandomSampler(samples_weight, len(samples_weight))

target = torch.from_numpy(target).long()
train_dataset = torch.utils.data.TensorDataset(data, target)

train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, sampler=sampler)

freqs = np.zeros(len(target.unique()))
for i, (data, t) in enumerate(train_loader):
    f = [len((t == i).nonzero()) for i in range(len(target.unique()))]
    print("batch index: {}, class count: {}".format(i, f))
    freqs += np.array(f)

print(freqs)
# [164. 185. 185. 139. 159. 168.]