Hi @ptrblck, I took the max batch size and checked. The dataset size is 7956. And the batch size of 9172 gave the following results. It is quite close to the mean. The max deviation is still about 10 percent. But I think that is expected due to the randomness.
0 [799, 818, 814, 820, 731, 791, 743, 837, 817, 786]
The class distribution look reasonable, so I think you are right.
Perhaps much later than the time the question was originally posted but I could suggest just using zero weights for the validation dataset. As the SubsetRandomSampler
can be conceptually thought of as WeightedRandomSampler
with zero weights outside of the subset and equal weights within the subset (i.e. a characterization of the set), you can simply do:
target = torch.cat((torch.zeros(999, dtype=torch.long),
torch.ones(1, dtype=torch.long)))
np.random.seed(1337)
indices = list(range(1000))
np.random.shuffle(indices)
train_indices, valid_indices = indices[20:], indices[:20]
class_sample_count = torch.tensor(
[(target == t).sum() for t in torch.unique(target, sorted=True)])
class_weights = 1 / class_sample_count.float()
sample_weights = torch.tensor([class_weights[t] for t in target])
# cannot draw from the validation data
sample_weights[valid_indices] = 0.0
train_sampler = WeightedRandomSampler(sample_weights, len(sample_weights))
assert len(set(valid_indices).intersection(set(train_sampler))) == 0
I find this solution more natural so decided to share it for everyone else finding this thread.
I am training an imbalanced dataset of 100 classes with counts range 10-10000, I am using a weight sampler to resolve the imbalanced issue but still on some classes with high counts I am getting very low accuracy. Is it due to overfitting or something else?
Please let me know, Thanks in adavnce
Your model might be still overfitting to the majority classes so you could increase the weigths for the minority samples and rerun the training.
For example, I have 3 classes (a,b, and c) and I want to balance their weights.
def sampler_(labels):
_, counts = np.unique(labels, return_counts=True)
weights = 1.0 / torch.tensor(counts, dtype=torch.float)
sample_weights = weights[[list(_).index(x) for x in labels]]
sampler = torch.utils.data.sampler.WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)
return sampler
train_labels = ['a','b','a','a','b','c','c','c']
sampler = sampler_(train_labels)
sampler
I think I am doing something wrong, Please help me with the correct code.
Thanks
The code snippet looks correct, but if your DataLoader
is not returning approx. balanced batches, you could refer to this example and compare the implementations.
How much should I increase or decrease?
please help me with a sample code
Thanks
If youâve verified that your weighted sampling works properly (i.e. the batches are balanced) and your model is still overfitting to a specific class (or underfitting to others) you could run several experiments with different weights and check the confusion matrix to select the appropriate sampling for your use case.
Hi @ptrblck ,
My dataset is imbalanced and I want to use WeightedRandomSampler
to overcome this.
Should the sampler be added only to the train loader or should it be added to the val and test loaders as well?
Note: Data related to all the classes are available in the train set but data for only some classes in the train set are available in the val and test sets.
Thank you!
I would naively use the original distributions in the validation and tests splits.
The test set should be used once your training is finished and would give you the model performance on âunseen, real world samplesâ, while the validation set would be a proxy for the test set during the training/validation. If you resample these tests sets you would lose the information about your modelâs final performance on new, unseen samples (assuming the distribution stays constant).
Hi @ptrblck ,
Thanks for the answer!
I edited your code for my requirement but I am not getting the expected results.
Code:
targets = np.array([data[2] for data in train_set])
print("Train set answers:", targets)
print("Length of train set answers:", len(targets))
print()
print("Answer vocab:")
print(VQADataset.ans_vocab_to_int)
print()
all_ans_values = np.array(list(VQADataset.ans_vocab_to_int.values()))
print("All the answer classes available in the dataset:")
print(all_ans_values)
print()
class_sample_count = np.array(
[np.count_nonzero(targets == ans) for ans in np.unique(targets)])
print("The count for each class above respectively:")
print(class_sample_count)
print()
weight = 1. / class_sample_count
print("Weight for each class:")
print(weight)
print()
samples_weight = np.array([weight[target] for target in targets])
print("Weight for each target:")
print(samples_weight)
samples_weight = torch.from_numpy(samples_weight)
samples_weight = samples_weight.double()
sampler = WeightedRandomSampler(samples_weight, len(samples_weight))
train_loader = DataLoader(
train_set, batch_size=23, num_workers=1, sampler=sampler)
for i, (question, image, target) in enumerate(train_loader):
print("batch index {}, 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22: {}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}/{}".format(
i,
np.count_nonzero(target.numpy() == 0),
np.count_nonzero(target.numpy() == 1),
np.count_nonzero(target.numpy() == 2),
np.count_nonzero(target.numpy() == 3),
np.count_nonzero(target.numpy() == 4),
np.count_nonzero(target.numpy() == 5),
np.count_nonzero(target.numpy() == 6),
np.count_nonzero(target.numpy() == 7),
np.count_nonzero(target.numpy() == 8),
np.count_nonzero(target.numpy() == 9),
np.count_nonzero(target.numpy() == 10),
np.count_nonzero(target.numpy() == 11),
np.count_nonzero(target.numpy() == 12),
np.count_nonzero(target.numpy() == 13),
np.count_nonzero(target.numpy() == 14),
np.count_nonzero(target.numpy() == 15),
np.count_nonzero(target.numpy() == 16),
np.count_nonzero(target.numpy() == 17),
np.count_nonzero(target.numpy() == 18),
np.count_nonzero(target.numpy() == 19),
np.count_nonzero(target.numpy() == 20),
np.count_nonzero(target.numpy() == 21),
np.count_nonzero(target.numpy() == 22)))
Please donât mind the ugly print statement for the dataloader loop
Output:
Train set answers: [ 9 10 11 0 3 4 12 5 0 5 0 0 13 0 14 15 0 1 16 1 1 1 1 0
17 1 1 0 18 2 6 1 7 0 19 8 20 0 1 21 0 4 2 1 4 8 22 1
6 0]
Length of train set answers: 50
Answer vocab:
{'yes': 0, 'no': 1, '1': 2, 'white': 3, 'skiing': 4, 'frisbee': 5, 'blue': 6, 'green': 7, 'gray': 8, 'net': 9, 'pitcher': 10, 'orange': 11, 'red': 12, 'contrail': 13, 'white and purple': 14, 'brushing teeth': 15, 'frowning': 16, 'black and white': 17, 'skateboard': 18, 'motorcycle': 19, '2': 20, 'purse': 21, 'skis': 22}
All the answer classes available in the dataset:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22]
The count for each class above respectively:
[12 11 2 1 3 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Weight for each class:
[0.08333333 0.09090909 0.5 1. 0.33333333 0.5
0.5 1. 0.5 1. 1. 1.
1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. ]
Weight for each target:
[1. 1. 1. 0.08333333 1. 0.33333333
1. 0.5 0.08333333 0.5 0.08333333 0.08333333
1. 0.08333333 1. 1. 0.08333333 0.09090909
1. 0.09090909 0.09090909 0.09090909 0.09090909 0.08333333
1. 0.09090909 0.09090909 0.08333333 1. 0.5
0.5 0.09090909 1. 0.08333333 1. 0.5
1. 0.08333333 0.09090909 1. 0.08333333 0.33333333
0.5 0.09090909 0.33333333 0.5 1. 0.09090909
0.5 0.08333333]
batch index 0, 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22: 0/3/1/1/1/2/2/1/0/1/0/0/3/0/4/0/0/1/1/0/1/1/0
batch index 1, 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22: 1/2/1/2/1/1/0/0/2/0/1/0/3/1/1/5/0/0/1/0/0/0/1
batch index 2, 0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22: 1/0/0/0/1/0/0/1/0/0/1/0/0/0/0/0/0/0/0/0/0/0/0
The question is: There are 12 targets available initially in the train set for the class 0. But after sampling it only has 2 for the entire dataloader. Is this the expected behaviour?
Yes, the reduction of the number of samples from the majority class and an increase for the minority classes are expected, if the same number of samples is used. You would âreplaceâ the majority classes by oversampling the minority classes.
Just for the sake of debugging: could you artificially increase the number of inputs (just repeat
the samples) to see, if the batches are indeed balanced? Currently it seems that the batch size is quite small compared to the number of classes, which yields this skewed result.
Thanks but isnât there a way to keep the number of majority class samples the same amount and increase all the other classes samples to the number of samples in the majority class?
Eg: if class 0 has 12000 data and class 10 has 200 data can I bring the number of samples to class zero 12000 and class 10 12000
Yes the data set size will increase but is this possible using the WeightedRandomSampler
This is mainly because I donât want to lose the data in the majority class.
The sampling is a random process, so you wouldnât be able to guarantee that all samples from the majority class were used (you could implement a custom sampler to do so), but you can increase the length in the sampler:
WeightedRandomSampler(samples_weight, len(samples_weight))
as you are currently setting it to len(samples_weight)
.
@ptrblck is there a way to use this method in combination with ImageFolder without writing a custom dataloader?
Yes, you should be able to access dataset.targets
and could create the sampler based on these class indices.
I have done that to grab specific indices, but then i need to be able to pass those indices along with the sample weights to the dataloader
My linked example shows how to use the targets
to create sample weights and pass it to the WeightedRandomSampler
. Does it not work for you? If not, what issues are you seeing and could you post a code snippet showing these?