WightedRandomSampler does not sample uniformly

damianofranzo · May 6, 2019, 5:26pm

Hi there!

I am facing some problem with WeightedRandomSampler since I want to sample uniformly data among several labels.

A really short description of the problem:
The problem concerns plankton image classification. Data is tremendously unbalanced as the majority of images are just “detritus”(which is not even plankton).
In numbers, the most common label is “detritus” with ~100k images in the dataset whether the least has just 12 samples. I want to have a high f1 “Macro” score, so I want to train my models through uniform distribution of labels.

Here I build the data loaders:

# weights for weighted random sampler

X_ = plankton_df['objid']
y_ = plankton_df['level2']

X_train, X_test, y_train, y_test = train_test_split(X_, y_, test_size=0.15, stratify=y_)

weights_df = plankton_df[plankton_df['objid'].isin(X_train.values)] # assign weights only to samples in the training set
weights_df['level2'] = weights_df['level2'].map(label_mapping)
weights_df['count'] = weights_df.groupby('level2')['level2'].transform('count')
weights_df['count'] = 1. / (weights_df['count'])
weights = weights_df['count'].values
weights = torch.DoubleTensor(weights)
detritus_wrs = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights), replacement=True)

# Extract images in memory

plank_train_dataset = PlanktonDataset(X_train.values, y_train.values,
                                        img_files = img_files,
                                        label_mapping = label_mapping,
                                        transform = training_transformations,
                                        )
plank_train_dataloader = DataLoader(plank_train_dataset, batch_size=64, sampler=detritus_wrs, num_workers=4)

with the line of code “weights_df[‘count’] = 1. / (weights_df[‘count’])” I just want to make sure that for each sample it is assigned the inverse of the cardinality of his label.

But if I run this cell

minibatch = next(iter(plank_train_dataloader))
print(minibatch['label'])

I get as output (labels are not uniform at all):

tensor([ 1,  0,  2,  4,  0,  0,  0,  2,  0, 15,  1,  3,  0,  0, 20, 11,  2,  0,
         0,  0,  0,  0,  1,  0,  0,  0,  2,  3,  1,  1,  2,  0, 12,  0,  0,  0,
         0, 15, 26,  0, 21,  0,  0, 27,  9,  2,  0,  3,  0,  0,  0,  2,  0,  0,
         0, 16,  0,  3,  0,  0,  1,  4,  0,  0])

Do you have any idea? It is more likely that there are some mistakes in my code rather than bugs on the library

I tried to be as more concise as possible, but If you need other pieces of code or other information of the problem just write here.

Thanks in advance for your consideration

ptrblck · May 6, 2019, 11:06pm

I’m not sure what kind of padas magic you are using, but could you print some values of your weights tensor?
I assume you are assigning the inverse class count to each sample, is that correct?

damianofranzo · May 7, 2019, 8:01am

Thanks for the answer!

Your intuition is right, that’s exactly what I do.

print(weights_df[:10])

output:

       objid  level2     count
0   32756761       0  0.000007
2   32758055      13  0.000463
3   32758988       5  0.000178
5   32760828       0  0.000007
6   32760820       0  0.000007
10  32760656       5  0.000178
11  32760819       1  0.000037
12  32760823       0  0.000007
13  32636402      17  0.001184
15  32761023       5  0.000178

N.B.Labels(level2 is the label) are mapped into integer numbers. Label numbers are sorted according to their frequency in decreasing order. Label 0 is detritus, which is the label with ~100k samples. and label 5 is silks with 5629 samples.

If I print the values instead (which are the ones that I use into the WeightedRandomSampler) that’s what I get.

print(weights_df['count'].values[:10])

[7.22e-06 4.63e-04 1.78e-04 7.22e-06 7.22e-06 1.78e-04 3.71e-05 7.22e-06
 1.18e-03 1.78e-04]

Here I print the tensor:

print(weights[:10])

tensor([7.2234e-06, 4.6296e-04, 1.7764e-04, 7.2234e-06, 7.2234e-06, 1.7764e-04,
        3.7124e-05, 7.2234e-06, 1.1838e-03, 1.7764e-04], dtype=torch.float64)

ptrblck · May 8, 2019, 10:08am

Your code looks alright.
Could you post the class counts or compare your code with this small example and check for differences?

damianofranzo · May 8, 2019, 8:03pm

I have run your code and it actually works on my cluster. It allows to perfectly balance between the two classes.
Output:

target train 0/1: 990/10
batch index 0, 0/1: 48/52
batch index 1, 0/1: 49/51
batch index 2, 0/1: 48/52
batch index 3, 0/1: 54/46
batch index 4, 0/1: 52/48
batch index 5, 0/1: 51/49
batch index 6, 0/1: 47/53
batch index 7, 0/1: 53/47
batch index 8, 0/1: 48/52
batch index 9, 0/1: 55/45

I don’t see any particular difference between the two pieces of code. The only curious difference was the number of workers. But even setting it to one I don’t get a correct distribution.

ptrblck · May 8, 2019, 8:22pm

That’s strange. Could you post the class counts, so that I could just create dummy data using your numbers?

damianofranzo · May 8, 2019, 9:54pm

Thanks to your example, finally I got the error. As usual, it is a quite simple error.

X_train and weights_df do not share the same order of samples, so weights were perfectly set just for weights_df but not for X_train.

X_train is obtained from the function train_test_split with the option stratify=y_, which makes sure that the distributions of labels in training and validation set are set as close as possibles. Therefore the order of the dataset is not maintained since the previous option

In practice, I just substituted the line

weights_df = plankton_df[plankton_df['objid'].isin(X_train.values)]

with

weights_df = pd.DataFrame({'objid':X_train.values, 'level2': y_train.values})

Thanks very much, your questions lead me to the solution

ptrblck · May 8, 2019, 10:38pm

Awesome you’ve figured it out! Sounds like a nasty bug