Sample Batches according to IDs

SaadMunir · May 15, 2021, 2:57pm

Hello everyone !
I have a large dataset about a casting machine and its corresponding process parameters . So basically each row is an observation with about 200 parameters and then the corresponding class label of 0 or 1 where 0 means the part is NOT defective and 1 means the part was badly manufactured. We are trying to map these process parameters to the correct class prediction and right now I am using a feedforward network with 5 layers and the area under ROC curve is 0.76.
What I have been thinking is maybe we lack temporal information in our batches which are now created using stratified train test split and later on balanced from both classes with a WeightedRandomSampler.
So my idea is to first sort the observations in ascending order of time, do not shuffle and then create the batches so that the temporal order is preserved. However when I take this approach, my mini batches become severely imbalanced. Nevertheless, there might be some mini batches where the minority class has a clear majority , for example 8 out of 10 samples in a mini batch belong to minority class. Lets say the IDs of such mini batches are from 40 to 60. So I would like the first iteration of my epoch to take a normal, highly unbalanced mini batch as the input, then the second iteration to take a minority class dominant mini batch as input i.e. a batch from ID between 40 and 60, then the third iteration again takes a normal , unbalanced batch as input and the fourth iteration again samples a minority class dominant batch, and then this process should repeat for a fixed number of iterations. In this way, temporal information would be preserved. I think to implement this strategy, I would have to create a custom generator or Dataloader in Pytorch but I have little idea about that. Could anybody please guide me over this?

ptrblck · May 16, 2021, 6:04am

Based on your description I think the proper way to implement this custom sampling strategy would be in a custom sampler. By default the DataLoader would use a SequentialSampler or a RandomSampler, which you could replace with your custom class. This custom sampler could then take the targets and create the batch indices using your sampling logic.

SaadMunir · May 16, 2021, 1:08pm

Thankyou very much for the idea! Could you please give me an idea of how my custom sampler class should look like ?

ptrblck · May 16, 2021, 9:06pm

You could take a look at the sampler implementations and adapt one to your use case. In particular I think you might want to pass the target tensors to the sampler and create the desired sample indices in the __iter__ method.