How to oversample most classes while leaving one class imbalanced?

BaruchG · August 25, 2020, 4:17pm

I have an imbalanced dataset with the items that I want to sample by which are not labels, but other features of the data. I would like to keep one of the classes at 50% with the other classes (5) divided between the remaining 50% so 10% chance of being chosen per class. The number of classes in the dataset (c) is:

Counter({'-1': 7557, '0': 3958, '2': 1306, '3': 1144, '4': 861, '1': 323})

with the -1 the one that I want to sample with 50% probability. I made a weighted random sampler to give me equal oversampling like this:

weight = {d : 1. / c[d] for d in c}
samples_weight = np.array([[weight[str(item[4])] for item in trainDS]])
sampler = WeightedRandomSampler(sw, len(sw), replacement=True)

which seems to give me equal proportions of each one at approximately 20%. I’m having trouble understanding what my weight variable should be in order to get a 50/10/10/10/10/10 split. In the docs it just says that weights should be “a sequence of weights, not necessary summing up to one”, which isn’t super helpful on what it actually should be.
How should I adjust it to get that split?

BaruchG · August 25, 2020, 5:36pm

If I do:
weight['-1'] *= 5
It seems to get to be approximately 50% but I’m still not sure how to figure out what the scaling factor should be analytically.

ayalaa2 · August 25, 2020, 6:03pm

I would just assume that your weights should sum up to 1.0, that’ll just make things easier. It’s true that the weighted sampler doesn’t need this requirement, but I’m guessing it just converts the weights to probabilities behind the scene anyways.

I would set your weight to this: torch.tensor([0.5, 0.1, 0.1, 0.1, 0.1, 0.1]). This should work just fine. Something like torch.tensor([50, 10, 10, 10, 10, 10]) is probably the same behind the scenes.

BaruchG · August 25, 2020, 6:11pm

Ok, I’ll try that out. I assumed that it somehow was not using probabilities but what you are saying makes a lot of sense, essentially a sort of softmax.

BaruchG · August 26, 2020, 3:34pm

Setting weights to be weight = {'-1': 0.5, '0': 0.1, '3': 0.1, '1': 0.1, '2': 0.1, '4': 0.1} did not work. The first class gets to be around 80% that way.

ayalaa2 · August 26, 2020, 8:21pm

Are you sure? This is the test I’m performing:


from torch.utils.data import WeightedRandomSampler

n = 100000
w = [0.5, 0.1, 0.1, 0.1, 0.1, 0.1]
l = list(WeightedRandomSampler(w, n))

for i in range(6):
    print(f'{i} - {100.0 * l.count(i) / n}%')

This is my output:

0 - 50.038%
1 - 9.967%
2 - 9.922%
3 - 10.028%
4 - 9.9%
5 - 10.145%

It might be the way you’re computing the weights from that dictionary you have.

BaruchG · August 26, 2020, 8:42pm

Hmm, yeah I’ll take a look a look at how I’m loading it and see if I can figure out where it’s going wrong.

BaruchG · August 27, 2020, 3:09pm

I modified some code from @ptrblck at https://discuss.pytorch.org/t/how-to-handle-imbalanced-classes/11264 and when I just change it the weights to [.3 .7] just as an example it does not work unless I’m misunderstanding you. Here’s what I did:

numDataPoints = 1000
data_dim = 5
bs = 100

# Create dummy data with class imbalance 9 to 1
data = torch.FloatTensor(numDataPoints, data_dim)
target = np.hstack((np.zeros(int(numDataPoints * 0.9), dtype=np.int32),
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32)))

print('target train 0/1: {}/{}'.format(len(np.where(target == 0)[0]), len(np.where(target == 1)[0])))

class_sample_count = np.array(
    [len(np.where(target == t)[0]) for t in np.unique(target)])
weight = 1. / class_sample_count
weight[0] = .3
weight[1] = .7
print(weight)

samples_weight = np.array([weight[t] for t in target])

samples_weight = torch.from_numpy(samples_weight)
samples_weigth = samples_weight.double()
sampler = WeightedRandomSampler(samples_weight, len(samples_weight))

target = torch.from_numpy(target).long()
train_dataset = torch.utils.data.TensorDataset(data, target)

train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, sampler=sampler)

for i, (data, target) in enumerate(train_loader):
  print("batch index {}, 0/1: {}/{}".format(i, len(np.where(target.numpy() == 0)[0]), len(np.where(target.numpy() == 1)[0])))

and the output was:

target train 0/1: 900/100
[0.3 0.7]
batch index 0, 0/1: 82/18
batch index 1, 0/1: 72/28
batch index 2, 0/1: 71/29
batch index 3, 0/1: 74/26
batch index 4, 0/1: 82/18
batch index 5, 0/1: 74/26
batch index 6, 0/1: 81/19
batch index 7, 0/1: 83/17
batch index 8, 0/1: 82/18
batch index 9, 0/1: 78/22

Are you sure your method will work if the dataset is imbalanced to start off with?

BaruchG · August 27, 2020, 3:29pm

Ok, I think that I figured it out with some inspiration from @ayalaa2 . If I take the balanced weightedrandomsampler weights and then multiply those by whatever proportions I want it works fine. I don’t know if there is a way to do it without “balancing” them first. For example:

numDataPoints = 1000
data_dim = 5
bs = 100

# Create dummy data with class imbalance 9 to 1
data = torch.FloatTensor(numDataPoints, data_dim)
target = np.hstack((np.zeros(int(numDataPoints * 0.9), dtype=np.int32),
                    np.ones(int(numDataPoints * 0.1), dtype=np.int32)))

print('target train 0/1: {}/{}'.format(len(np.where(target == 0)[0]), len(np.where(target == 1)[0])))

class_sample_count = np.array(
    [len(np.where(target == t)[0]) for t in np.unique(target)])
weight = 1. / class_sample_count
weight = weight * [.3,.7]
print(weight)

samples_weight = np.array([weight[t] for t in target])

samples_weight = torch.from_numpy(samples_weight)
samples_weigth = samples_weight.double()
sampler = WeightedRandomSampler(samples_weight, len(samples_weight))

target = torch.from_numpy(target).long()
train_dataset = torch.utils.data.TensorDataset(data, target)

train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, sampler=sampler)

for i, (data, target) in enumerate(train_loader):
  print("batch index {}, 0/1: {}/{}".format(i, len(np.where(target.numpy() == 0)[0]), len(np.where(target.numpy() == 1)[0])))

This outputs:

target train 0/1: 900/100
[0.00033333 0.007     ]
batch index 0, 0/1: 29/71
batch index 1, 0/1: 31/69
batch index 2, 0/1: 20/80
batch index 3, 0/1: 39/61
batch index 4, 0/1: 34/66
batch index 5, 0/1: 34/66
batch index 6, 0/1: 33/67
batch index 7, 0/1: 31/69
batch index 8, 0/1: 34/66
batch index 9, 0/1: 34/66

I tried multiplying the desired class weight by 5 in my case (and implicitly multiplying the other classes by 1) and it seems to have the effect of making that class get chosen with a 50% probability.