Increase number of images

I am trying to replicate a paper where they have physically increased the data set size as stated: The strategy is data augmentation, we augment the ischaemia class by using 8 data augmentation techniques, i.e flip horizontal and vertical, rotate-180 degree, add Gaussian noise, crop, shear, scale and adjust contrast by gamma 2.0, yielded a total 2,043 patches after augmentation.

Currently, I am using the transforms.Compose which augments images on the fly without physically changing the number of images. How can I actually increase the number of images as mentioned in the paper?

Since each sample will be transformed on-the-fly, you could increase the number of epochs, which should have the same effect as directly creating the augmented images and using them in a single epoch.

@ptrblck I want to physically increase the number of images as stated in the paper because the data set is imabalnaced (class1: 2555, class 2: 227, class 3: 621, class 4: 2552 images). If I increase the epochs, transforms will be applied equally to all of them. And the classification will be based on the majority class.

You could use different approaches:

  • you could oversample the minority classes using a WeightedRandomSampler and increase the number of epochs
  • alternatively, you could duplicate indices of the minority classes using a Subset with the desired sample indices
  • or you could directly create copies of the minority class samples (this would use more memory, since you are copying real data)

@ptrblck, I am trying to use WeightedRandomSampler for handling imbalance in the dataset. However, the intuition behind it is not clear to me. My target labels are in form of one-hot encoded vectors as below.

none infection ischaemia both
0 1 0 0 0
1 1 0 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0

Below are the steps, I used to calculate for the weighted random sampler. Please correct me if I am wrong with the interpretation of any steps.

  1. Count the number of samples per class in the dataset
class_sample_count = np.array(train_labels.value_counts()) 
array([2555, 2552,  621,  227])
  1. Calculate the weight associated with each class
weight = 1. / class_sample_count 
array([0.00039139, 0.00039185, 0.00161031, 0.00440529])
  1. Calculate the weight for each of the samples in the dataset.
samples_weight = np.array(weight[train_labels])
print(samples_weight[1], samples_weight[2] )
[0.00039185 0.00039139 0.00039139 0.00039139] #label 0 in actual data 
[0.00039139 0.00039185 0.00039139 0.00039139] #label 1 in actual data

The dimension of samples_weight comes to be [5955, 4]. 5955 are the total no. of images in the original set, and 4 corresponds to the total number of classes.
Now how this mapping has been done? Since class weight for class 0 is 0.00039139 (obtained in step 2). How were the rest of the three entries picked up for class 0?

  1. Convert the np.array to tensor
samples_weight = torch.from_numpy(samples_weight)
tensor([[0.0004, 0.0004, 0.0004, 0.0004],
        [0.0004, 0.0004, 0.0004, 0.0004],
        [0.0004, 0.0004, 0.0004, 0.0004],
        [0.0004, 0.0004, 0.0004, 0.0004],
        [0.0004, 0.0004, 0.0004, 0.0004],
        [0.0004, 0.0004, 0.0004, 0.0004]], dtype=torch.float64)

After conversion to tensor, all the samples appear to have the same value in all four enteries? Then how does Weighted Random Sampling is oversampling the minority class?

I will be grateful for any leads. Thank you.

One-hot encoded targets are not expected and you should transform them to class indices via target = torch.argmax(target, 1). An example of using WeightedRandomSampler can be found here.

@ptrblck thank you for your help and patience with me. I have followed your link. I converted the one-hot encoded to class indices.

labels = np.argmax(train_labels.loc[:, 'none':'both'].values, axis=1)
labels = torch.from_numpy(labels)
tensor([0, 0, 1,  ..., 1, 0, 0])
a = torch.unique(samples_weight)
tensor([0.0004, 0.0004, 0.0016, 0.0044], dtype=torch.float64)

Now the tensor has a unique weight for each of the target labels. Now for the intuition: the sampler picks the labels with higher weight more frequently in a batch than the samples with lower weight? So, overall in a batch, the sampling is balanced irrespective of the class distribution in the dataset?

I computed the weighted random sampler from the original dataset. After that, I split the dataset into the train and validation set. For both sets, the same sampler should be used. right? As it gives is weight is computed from overall distribution in the original set.

Don’t forget to create the weight for each sample, not only the classes, as shown in the example.

Yes, the higher the weight the more likely this sample will be drawn. Also yes, you can balance the class distribution in a batch, but note that it’s still a random process so you shouldn’t expect to see a perfectly balanced batch in each iteration.

It depends on your use case and I don’t know if you want to balance the validation dataset. Usually it’s used as a proxy for the final test set (i.e. unknown, “real world” samples). If your final application would also use an imbalanced class distribution, you would have to decide which metric you care about.
My thinking would be to treat the validation dataset similar to the test one, so that you can indeed use the validation metric as a proxy for the test one, but let’s wait what others are thinking.