Sample Batches according to IDs

Hello everyone !
I have a large dataset about a casting machine and its corresponding process parameters . So basically each row is an observation with about 200 parameters and then the corresponding class label of 0 or 1 where 0 means the part is NOT defective and 1 means the part was badly manufactured. We are trying to map these process parameters to the correct class prediction and right now I am using a feedforward network with 5 layers and the area under ROC curve is 0.76.
What I have been thinking is maybe we lack temporal information in our batches which are now created using stratified train test split and later on balanced from both classes with a WeightedRandomSampler.
So my idea is to first sort the observations in ascending order of time, do not shuffle and then create the batches so that the temporal order is preserved. However when I take this approach, my mini batches become severely imbalanced. Nevertheless, there might be some mini batches where the minority class has a clear majority , for example 8 out of 10 samples in a mini batch belong to minority class. Lets say the IDs of such mini batches are from 40 to 60. So I would like the first iteration of my epoch to take a normal, highly unbalanced mini batch as the input, then the second iteration to take a minority class dominant mini batch as input i.e. a batch from ID between 40 and 60, then the third iteration again takes a normal , unbalanced batch as input and the fourth iteration again samples a minority class dominant batch, and then this process should repeat for a fixed number of iterations. In this way, temporal information would be preserved. I think to implement this strategy, I would have to create a custom generator or Dataloader in Pytorch but I have little idea about that. Could anybody please guide me over this?

Based on your description I think the proper way to implement this custom sampling strategy would be in a custom sampler. By default the DataLoader would use a SequentialSampler or a RandomSampler, which you could replace with your custom class. This custom sampler could then take the targets and create the batch indices using your sampling logic.

Thankyou very much for the idea! Could you please give me an idea of how my custom sampler class should look like ?

You could take a look at the sampler implementations and adapt one to your use case. In particular I think you might want to pass the target tensors to the sampler and create the desired sample indices in the __iter__ method.

1 Like