DataLoader - using SubsetRandomSampler and WeightedRandomSampler at the same time

nikhil-nakhate · March 6, 2020, 5:54pm

Hi @ptrblck, I took the max batch size and checked. The dataset size is 7956. And the batch size of 9172 gave the following results. It is quite close to the mean. The max deviation is still about 10 percent. But I think that is expected due to the randomness.
0 [799, 818, 814, 820, 731, 791, 743, 837, 817, 786]

ptrblck · March 6, 2020, 9:58pm

The class distribution look reasonable, so I think you are right.

pevogam · September 19, 2020, 4:55pm

Perhaps much later than the time the question was originally posted but I could suggest just using zero weights for the validation dataset. As the SubsetRandomSampler can be conceptually thought of as WeightedRandomSampler with zero weights outside of the subset and equal weights within the subset (i.e. a characterization of the set), you can simply do:

target = torch.cat((torch.zeros(999, dtype=torch.long),
                    torch.ones(1, dtype=torch.long)))
np.random.seed(1337)
indices = list(range(1000))
np.random.shuffle(indices)
train_indices, valid_indices = indices[20:], indices[:20]

class_sample_count = torch.tensor(
    [(target == t).sum() for t in torch.unique(target, sorted=True)])
class_weights = 1 / class_sample_count.float()
sample_weights = torch.tensor([class_weights[t] for t in target])
# cannot draw from the validation data
sample_weights[valid_indices] = 0.0

train_sampler = WeightedRandomSampler(sample_weights, len(sample_weights))
assert len(set(valid_indices).intersection(set(train_sampler))) == 0

I find this solution more natural so decided to share it for everyone else finding this thread.

monster · September 22, 2020, 7:17pm

I am training an imbalanced dataset of 100 classes with counts range 10-10000, I am using a weight sampler to resolve the imbalanced issue but still on some classes with high counts I am getting very low accuracy. Is it due to overfitting or something else?

Please let me know, Thanks in adavnce

ptrblck · September 23, 2020, 6:37am

Your model might be still overfitting to the majority classes so you could increase the weigths for the minority samples and rerun the training.

monster · September 23, 2020, 6:42pm

For example, I have 3 classes (a,b, and c) and I want to balance their weights.

def sampler_(labels):
    _, counts = np.unique(labels, return_counts=True) 
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    sample_weights = weights[[list(_).index(x) for x in labels]]
    sampler = torch.utils.data.sampler.WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)
    return sampler
train_labels = ['a','b','a','a','b','c','c','c']
sampler = sampler_(train_labels)
sampler

I think I am doing something wrong, Please help me with the correct code.
Thanks

ptrblck · September 24, 2020, 4:21am

The code snippet looks correct, but if your DataLoader is not returning approx. balanced batches, you could refer to this example and compare the implementations.

monster · October 5, 2020, 12:51pm

How much should I increase or decrease?
please help me with a sample code

Thanks

ptrblck · October 5, 2020, 11:31pm

If you’ve verified that your weighted sampling works properly (i.e. the batches are balanced) and your model is still overfitting to a specific class (or underfitting to others) you could run several experiments with different weights and check the confusion matrix to select the appropriate sampling for your use case.

aka4rKO · April 28, 2021, 9:15pm

Hi @ptrblck ,

My dataset is imbalanced and I want to use WeightedRandomSampler to overcome this.
Should the sampler be added only to the train loader or should it be added to the val and test loaders as well?

Note: Data related to all the classes are available in the train set but data for only some classes in the train set are available in the val and test sets.

Thank you!

ptrblck · April 28, 2021, 9:43pm

I would naively use the original distributions in the validation and tests splits.
The test set should be used once your training is finished and would give you the model performance on “unseen, real world samples”, while the validation set would be a proxy for the test set during the training/validation. If you resample these tests sets you would lose the information about your model’s final performance on new, unseen samples (assuming the distribution stays constant).

aka4rKO · April 29, 2021, 10:15am

Hi @ptrblck ,

Thanks for the answer!
I edited your code for my requirement but I am not getting the expected results.
Code:

targets = np.array([data[2] for data in train_set])
print("Train set answers:", targets)
print("Length of train set answers:", len(targets))
print()
print("Answer vocab:")
print(VQADataset.ans_vocab_to_int)
print()
all_ans_values = np.array(list(VQADataset.ans_vocab_to_int.values()))
print("All the answer classes available in the dataset:")
print(all_ans_values)
print()

class_sample_count = np.array(
    [np.count_nonzero(targets == ans) for ans in np.unique(targets)])

print("The count for each class above respectively:")
print(class_sample_count)
print()

weight = 1. / class_sample_count
print("Weight for each class:")
print(weight)
print()

samples_weight = np.array([weight[target] for target in targets])
print("Weight for each target:")
print(samples_weight)

samples_weight = torch.from_numpy(samples_weight)
samples_weight = samples_weight.double()
sampler = WeightedRandomSampler(samples_weight, len(samples_weight))

train_loader = DataLoader(
    train_set, batch_size=23, num_workers=1, sampler=sampler)

for i, (question, image, target) in enumerate(train_loader):
    print("batch index {}, 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22: {}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}".format(
        i,
        np.count_nonzero(target.numpy() == 0),
        np.count_nonzero(target.numpy() == 1),
        np.count_nonzero(target.numpy() == 2),
        np.count_nonzero(target.numpy() == 3),
        np.count_nonzero(target.numpy() == 4),
        np.count_nonzero(target.numpy() == 5),
        np.count_nonzero(target.numpy() == 6),
        np.count_nonzero(target.numpy() == 7),
        np.count_nonzero(target.numpy() == 8),
        np.count_nonzero(target.numpy() == 9),
        np.count_nonzero(target.numpy() == 10),
        np.count_nonzero(target.numpy() == 11),
        np.count_nonzero(target.numpy() == 12),
        np.count_nonzero(target.numpy() == 13),
        np.count_nonzero(target.numpy() == 14),
        np.count_nonzero(target.numpy() == 15),
        np.count_nonzero(target.numpy() == 16),
        np.count_nonzero(target.numpy() == 17),
        np.count_nonzero(target.numpy() == 18),
        np.count_nonzero(target.numpy() == 19),
        np.count_nonzero(target.numpy() == 20),
        np.count_nonzero(target.numpy() == 21),
        np.count_nonzero(target.numpy() == 22)))

Please don’t mind the ugly print statement for the dataloader loop

Output:

Train set answers: [ 9 10 11  0  3  4 12  5  0  5  0  0 13  0 14 15  0  1 16  1  1  1  1  0
 17  1  1  0 18  2  6  1  7  0 19  8 20  0  1 21  0  4  2  1  4  8 22  1
  6  0]
Length of train set answers: 50

Answer vocab:
{'yes': 0, 'no': 1, '1': 2, 'white': 3, 'skiing': 4, 'frisbee': 5, 'blue': 6, 'green': 7, 'gray': 8, 'net': 9, 'pitcher': 10, 'orange': 11, 'red': 12, 'contrail': 13, 'white and purple': 14, 'brushing teeth': 15, 'frowning': 16, 'black and white': 17, 'skateboard': 18, 'motorcycle': 19, '2': 20, 'purse': 21, 'skis': 22}

All the answer classes available in the dataset:
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22]

The count for each class above respectively:
[12 11  2  1  3  2  2  1  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1]

Weight for each class:
[0.08333333 0.09090909 0.5        1.         0.33333333 0.5
 0.5        1.         0.5        1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.        ]

Weight for each target:
[1.         1.         1.         0.08333333 1.         0.33333333
 1.         0.5        0.08333333 0.5        0.08333333 0.08333333
 1.         0.08333333 1.         1.         0.08333333 0.09090909
 1.         0.09090909 0.09090909 0.09090909 0.09090909 0.08333333
 1.         0.09090909 0.09090909 0.08333333 1.         0.5
 0.5        0.09090909 1.         0.08333333 1.         0.5
 1.         0.08333333 0.09090909 1.         0.08333333 0.33333333
 0.5        0.09090909 0.33333333 0.5        1.         0.09090909
 0.5        0.08333333]
batch index 0, 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22: 0/3/1/1/1/2/2/1/0/1/0/0/3/0/4/0/0/1/1/0/1/1/0
batch index 1, 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22: 1/2/1/2/1/1/0/0/2/0/1/0/3/1/1/5/0/0/1/0/0/0/1
batch index 2, 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22: 1/0/0/0/1/0/0/1/0/0/1/0/0/0/0/0/0/0/0/0/0/0/0

The question is: There are 12 targets available initially in the train set for the class 0. But after sampling it only has 2 for the entire dataloader. Is this the expected behaviour?

ptrblck · April 29, 2021, 5:33pm

Yes, the reduction of the number of samples from the majority class and an increase for the minority classes are expected, if the same number of samples is used. You would “replace” the majority classes by oversampling the minority classes.

Just for the sake of debugging: could you artificially increase the number of inputs (just repeat the samples) to see, if the batches are indeed balanced? Currently it seems that the batch size is quite small compared to the number of classes, which yields this skewed result.

aka4rKO · April 29, 2021, 10:14pm

Thanks but isn’t there a way to keep the number of majority class samples the same amount and increase all the other classes samples to the number of samples in the majority class?

Eg: if class 0 has 12000 data and class 10 has 200 data can I bring the number of samples to class zero 12000 and class 10 12000

Yes the data set size will increase but is this possible using the WeightedRandomSampler

This is mainly because I don’t want to lose the data in the majority class.

ptrblck · April 29, 2021, 11:30pm

The sampling is a random process, so you wouldn’t be able to guarantee that all samples from the majority class were used (you could implement a custom sampler to do so), but you can increase the length in the sampler:

WeightedRandomSampler(samples_weight, len(samples_weight))

as you are currently setting it to len(samples_weight).

aka4rKO · April 29, 2021, 11:34pm

Thanks alot @ptrblck !!!
You are very informative and appreciate the fast replies.

ajwitty · July 30, 2021, 9:55pm

@ptrblck is there a way to use this method in combination with ImageFolder without writing a custom dataloader?

ptrblck · July 30, 2021, 10:41pm

Yes, you should be able to access dataset.targets and could create the sampler based on these class indices.

ajwitty · July 30, 2021, 10:53pm

I have done that to grab specific indices, but then i need to be able to pass those indices along with the sample weights to the dataloader

ptrblck · July 30, 2021, 11:03pm

My linked example shows how to use the targets to create sample weights and pass it to the WeightedRandomSampler. Does it not work for you? If not, what issues are you seeing and could you post a code snippet showing these?