DataLoader - using SubsetRandomSampler and WeightedRandomSampler at the same time

Hi @ptrblck, I took the max batch size and checked. The dataset size is 7956. And the batch size of 9172 gave the following results. It is quite close to the mean. The max deviation is still about 10 percent. But I think that is expected due to the randomness.
0 [799, 818, 814, 820, 731, 791, 743, 837, 817, 786]

The class distribution look reasonable, so I think you are right.

Perhaps much later than the time the question was originally posted but I could suggest just using zero weights for the validation dataset. As the SubsetRandomSampler can be conceptually thought of as WeightedRandomSampler with zero weights outside of the subset and equal weights within the subset (i.e. a characterization of the set), you can simply do:

target = torch.cat((torch.zeros(999, dtype=torch.long),
                    torch.ones(1, dtype=torch.long)))
np.random.seed(1337)
indices = list(range(1000))
np.random.shuffle(indices)
train_indices, valid_indices = indices[20:], indices[:20]

class_sample_count = torch.tensor(
    [(target == t).sum() for t in torch.unique(target, sorted=True)])
class_weights = 1 / class_sample_count.float()
sample_weights = torch.tensor([class_weights[t] for t in target])
# cannot draw from the validation data
sample_weights[valid_indices] = 0.0

train_sampler = WeightedRandomSampler(sample_weights, len(sample_weights))
assert len(set(valid_indices).intersection(set(train_sampler))) == 0

I find this solution more natural so decided to share it for everyone else finding this thread.

1 Like

I am training an imbalanced dataset of 100 classes with counts range 10-10000, I am using a weight sampler to resolve the imbalanced issue but still on some classes with high counts I am getting very low accuracy. Is it due to overfitting or something else?

Please let me know, Thanks in adavnce

Your model might be still overfitting to the majority classes so you could increase the weigths for the minority samples and rerun the training.

For example, I have 3 classes (a,b, and c) and I want to balance their weights.

def sampler_(labels):
    _, counts = np.unique(labels, return_counts=True) 
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    sample_weights = weights[[list(_).index(x) for x in labels]]
    sampler = torch.utils.data.sampler.WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)
    return sampler
train_labels = ['a','b','a','a','b','c','c','c']
sampler = sampler_(train_labels)
sampler

I think I am doing something wrong, Please help me with the correct code.
Thanks

The code snippet looks correct, but if your DataLoader is not returning approx. balanced batches, you could refer to this example and compare the implementations.

How much should I increase or decrease?
please help me with a sample code

Thanks

If you’ve verified that your weighted sampling works properly (i.e. the batches are balanced) and your model is still overfitting to a specific class (or underfitting to others) you could run several experiments with different weights and check the confusion matrix to select the appropriate sampling for your use case.

Hi @ptrblck ,

My dataset is imbalanced and I want to use WeightedRandomSampler to overcome this.
Should the sampler be added only to the train loader or should it be added to the val and test loaders as well?

Note: Data related to all the classes are available in the train set but data for only some classes in the train set are available in the val and test sets.

Thank you!

I would naively use the original distributions in the validation and tests splits.
The test set should be used once your training is finished and would give you the model performance on “unseen, real world samples”, while the validation set would be a proxy for the test set during the training/validation. If you resample these tests sets you would lose the information about your model’s final performance on new, unseen samples (assuming the distribution stays constant).

1 Like

Hi @ptrblck ,

Thanks for the answer!
I edited your code for my requirement but I am not getting the expected results.
Code:

targets = np.array([data[2] for data in train_set])
print("Train set answers:", targets)
print("Length of train set answers:", len(targets))
print()
print("Answer vocab:")
print(VQADataset.ans_vocab_to_int)
print()
all_ans_values = np.array(list(VQADataset.ans_vocab_to_int.values()))
print("All the answer classes available in the dataset:")
print(all_ans_values)
print()

class_sample_count = np.array(
    [np.count_nonzero(targets == ans) for ans in np.unique(targets)])

print("The count for each class above respectively:")
print(class_sample_count)
print()

weight = 1. / class_sample_count
print("Weight for each class:")
print(weight)
print()

samples_weight = np.array([weight[target] for target in targets])
print("Weight for each target:")
print(samples_weight)

samples_weight = torch.from_numpy(samples_weight)
samples_weight = samples_weight.double()
sampler = WeightedRandomSampler(samples_weight, len(samples_weight))

train_loader = DataLoader(
    train_set, batch_size=23, num_workers=1, sampler=sampler)

for i, (question, image, target) in enumerate(train_loader):
    print("batch index {}, 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22: {}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}".format(
        i,
        np.count_nonzero(target.numpy() == 0),
        np.count_nonzero(target.numpy() == 1),
        np.count_nonzero(target.numpy() == 2),
        np.count_nonzero(target.numpy() == 3),
        np.count_nonzero(target.numpy() == 4),
        np.count_nonzero(target.numpy() == 5),
        np.count_nonzero(target.numpy() == 6),
        np.count_nonzero(target.numpy() == 7),
        np.count_nonzero(target.numpy() == 8),
        np.count_nonzero(target.numpy() == 9),
        np.count_nonzero(target.numpy() == 10),
        np.count_nonzero(target.numpy() == 11),
        np.count_nonzero(target.numpy() == 12),
        np.count_nonzero(target.numpy() == 13),
        np.count_nonzero(target.numpy() == 14),
        np.count_nonzero(target.numpy() == 15),
        np.count_nonzero(target.numpy() == 16),
        np.count_nonzero(target.numpy() == 17),
        np.count_nonzero(target.numpy() == 18),
        np.count_nonzero(target.numpy() == 19),
        np.count_nonzero(target.numpy() == 20),
        np.count_nonzero(target.numpy() == 21),
        np.count_nonzero(target.numpy() == 22)))

Please don’t mind the ugly print statement for the dataloader loop :slight_smile:

Output:

Train set answers: [ 9 10 11  0  3  4 12  5  0  5  0  0 13  0 14 15  0  1 16  1  1  1  1  0
 17  1  1  0 18  2  6  1  7  0 19  8 20  0  1 21  0  4  2  1  4  8 22  1
  6  0]
Length of train set answers: 50

Answer vocab:
{'yes': 0, 'no': 1, '1': 2, 'white': 3, 'skiing': 4, 'frisbee': 5, 'blue': 6, 'green': 7, 'gray': 8, 'net': 9, 'pitcher': 10, 'orange': 11, 'red': 12, 'contrail': 13, 'white and purple': 14, 'brushing teeth': 15, 'frowning': 16, 'black and white': 17, 'skateboard': 18, 'motorcycle': 19, '2': 20, 'purse': 21, 'skis': 22}

All the answer classes available in the dataset:
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22]

The count for each class above respectively:
[12 11  2  1  3  2  2  1  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1]

Weight for each class:
[0.08333333 0.09090909 0.5        1.         0.33333333 0.5
 0.5        1.         0.5        1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.        ]

Weight for each target:
[1.         1.         1.         0.08333333 1.         0.33333333
 1.         0.5        0.08333333 0.5        0.08333333 0.08333333
 1.         0.08333333 1.         1.         0.08333333 0.09090909
 1.         0.09090909 0.09090909 0.09090909 0.09090909 0.08333333
 1.         0.09090909 0.09090909 0.08333333 1.         0.5
 0.5        0.09090909 1.         0.08333333 1.         0.5
 1.         0.08333333 0.09090909 1.         0.08333333 0.33333333
 0.5        0.09090909 0.33333333 0.5        1.         0.09090909
 0.5        0.08333333]
batch index 0, 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22: 0/3/1/1/1/2/2/1/0/1/0/0/3/0/4/0/0/1/1/0/1/1/0
batch index 1, 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22: 1/2/1/2/1/1/0/0/2/0/1/0/3/1/1/5/0/0/1/0/0/0/1
batch index 2, 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22: 1/0/0/0/1/0/0/1/0/0/1/0/0/0/0/0/0/0/0/0/0/0/0

The question is: There are 12 targets available initially in the train set for the class 0. But after sampling it only has 2 for the entire dataloader. Is this the expected behaviour?

Yes, the reduction of the number of samples from the majority class and an increase for the minority classes are expected, if the same number of samples is used. You would “replace” the majority classes by oversampling the minority classes.

Just for the sake of debugging: could you artificially increase the number of inputs (just repeat the samples) to see, if the batches are indeed balanced? Currently it seems that the batch size is quite small compared to the number of classes, which yields this skewed result.

Thanks but isn’t there a way to keep the number of majority class samples the same amount and increase all the other classes samples to the number of samples in the majority class?

Eg: if class 0 has 12000 data and class 10 has 200 data can I bring the number of samples to class zero 12000 and class 10 12000

Yes the data set size will increase but is this possible using the WeightedRandomSampler

This is mainly because I don’t want to lose the data in the majority class.

The sampling is a random process, so you wouldn’t be able to guarantee that all samples from the majority class were used (you could implement a custom sampler to do so), but you can increase the length in the sampler:

WeightedRandomSampler(samples_weight, len(samples_weight))

as you are currently setting it to len(samples_weight).

1 Like

Thanks alot @ptrblck !!!
You are very informative and appreciate the fast replies. :slight_smile:

1 Like

@ptrblck is there a way to use this method in combination with ImageFolder without writing a custom dataloader?

Yes, you should be able to access dataset.targets and could create the sampler based on these class indices.

I have done that to grab specific indices, but then i need to be able to pass those indices along with the sample weights to the dataloader

My linked example shows how to use the targets to create sample weights and pass it to the WeightedRandomSampler. Does it not work for you? If not, what issues are you seeing and could you post a code snippet showing these?