[K-means] - Sampling in Dataloader

nrupatunga · May 27, 2020, 11:49am

I have a dataset for classification like below

class-1: 2500 images
class-2: 2500 images
class-3: 2500 images
...
...
class-10: 2500 images

Since each class has variations that could be clustered in its own class. Like if I run k-means for each class with 10 clusters

class-1: 2500 images split into 10 clusters
class-2: 2500 images split into 10 clusters
..
..

Now in the dataloader for a batch size of 100

- 10 examples from class-1 (from each of the ten clusters in class-1) 
- 10 examples from class-2 (from each of the ten clusters in class-2)
...
- 10 examples from class-2 (from each of the ten clusters in class-10)

In total there will be good 100 samples in the batch.

Is there a way to do it and how do I do it efficiently?

ptrblck · May 28, 2020, 5:36am

This use case might be possible to implement using a custom sampler.
You would need to create the mapping between each sample and the class as well as the current cluster.
Using this mapping, you could then implement a custom sampler by deriving from the base class here.

The sampler will return the indices for the current sample or batch.
It might be easier to derive from BatchedSampler and pass a batch of indices into the Dataset instead of single indices.

abhinavagarwalla · September 16, 2020, 3:45pm

Hi @nrupatunga
I am trying to come up with an approach to sample according to some clustering.
May I ask, if you had any success implementing a custom loader?