WeightedRandomSampler for one hot encoded targets

ananthk · December 5, 2020, 3:14pm

Hi,
I am working on an imbalanced dataset of around 5000 images with 10 classes. I have to consider each of these classes as individual binary classification tasks during the forward pass for my project. Therefore I have got weights for each task, i.e a numpy array of dimensions (10,2). May I know how to implement the weight sampler here, so that it becomes balanced for each task?

Thanks a lot in advance.

ptrblck · December 7, 2020, 9:28am

I’m not sure how the samples should be drawn to balance them. The WeightedRandomSampler uses a weight for each sample and applies it to draw the data. Would it be possible to provide a weight for each sample? If not, could you explain how each batch should be created?

ananthk · December 8, 2020, 7:37am

I am basically working on a model that changes the binary classification task every batch during training. So I want to balance the samples for that particular task every forward pass. I have collected the weights over the whole dataset for each task and stored them in a (10,2) array (10 tasks). For example [[1 2.43] [1 3.5]…]]. Could you tell me about a way to go about this?

About the data:
I get a sample of dimensions (64,3,224,224) where 64 is the batch size, and the labels array of size (64,10) from the dataloader. I extract columns accordingly to get the labels for a particular task.

ptrblck · December 8, 2020, 9:01am

Based on the description it seems that each batch would have specific requirements, since each forward pass would expect to use a special task.
If that’s the case, I think the cleanest way would be to create a custom sampler and precalculate the indices based on your task logic (or pass the precalculated indices to a SequentialSampler).

ananthk · December 8, 2020, 10:11am

Got it. The aim is to generate augmented images to reduce the imbalance present in the dataset for the respective tasks. Probably I could create dataloaders for each task. If that is possible, is there a way to cleanly call the dataloaders in the batch loop during training?

ptrblck · December 8, 2020, 10:21am

Yes, you could create the iterators via:

iter1 = iter(loader1)
iter2 = iter(loader2)
...

and get the batches via: batch = next(iter1) etc.
Note that you would have to catch the exception once the iterator is done and would have to recreate the iterators afterwards.

ananthk · December 8, 2020, 11:21am

I see! Thanks a lot for the help!

ananthk · December 9, 2020, 10:41am

I have one more doubt. Is there a way to augment only the extra images got from oversampling using the WeightedRandomSampler?

ptrblck · December 10, 2020, 12:15am

If you are using two different DataLoaders (one with the standard sampler, the other with the WeightedRandomSampler), you could also pass different transformations to the datasets and thus make sure the augmentations are only applied to one of them.

ananthk · December 10, 2020, 7:40am

Yes, but wouldn’t the one with the standard sampler, be imbalanced again? Is there a way to extract just the oversampled images?

ptrblck · December 10, 2020, 7:48am

Sorry, I misunderstood the question. If I understand it now correctly, you would like to only augment data samples, which were re-drawn while the first occurrence of the sample should not be transformed?

If that’s the case, I think the cleanest approach would be to write a custom sampler, reuse the WeightedSampler logic and add another flag to indicate if the current sample index should be transformed or not.
You could theoretically also track the indices in the Dataset, but it would break if you are using multiple workers (which is the common case).

ananthk · December 10, 2020, 8:31am

Yes that’s what I meant.

I see! Regarding the tracking of indices, could you tell me how to figure out if the index points to the re-drawn image, or are they generally appended at the end?

ptrblck · December 10, 2020, 8:34am

The indices are drawn from torch.multinomial and are thus not appended at the end as seen here. You could check for duplicates in these indices and create the flags accordingly.

ananthk · December 10, 2020, 8:37am

Got it. Thank you so much! I think that extracting the re-drawn images would be a useful feature in this sampler in the future.

ananthk · December 14, 2020, 12:20pm

weight=torch.Tensor(np.array([0.0018,0.01]))
rand_tensor = torch.multinomial(weight, 64, replacement=True)
print(rand_tensor)

Output : tensor([1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
These are the indices returned by the multinomial function.

I have a doubt regarding the working of the sampler. May I know how this can be used to find duplicate images? Sorry if the question is a basic one, I am new to writing code for samplers in PyTorch.

ptrblck · December 14, 2020, 8:39pm

The weight tensor should contain a weight value for each sample.
In your case you are using only 2 samples and draw them 64 times, which explains the result.