I’m working on a project where it’s advantageous to use a sampler that is fed with a list of indices (as is the case with SubsetRandomSampler()), however in using the WeightedRandomSampler() – where it uses len(weights) instead – my dataloader is trying to load indices that I don’t want.
I.e. I want: indices = [0,2,5,6] and I process my weights using the infamous make_weights_for_balanced_classes method, but len(weights) would prompt the loader to look for samples 0,1,2,3.
Question: Can I achieve what I need without messing about with the source? If so, any hints are greatly appreciated. Hope all’s clear.
Cheers Piotr, only issue is that the data themselves don’t exist and my dataloader’s __getitem__ looks for samples by the index. So my samples folder only contains the indices I pass.
If you think that the solution to this would inevitably be papering over cracks and that it’s better to redesign my pipeline, do let me know and I can go back to the drawing board, else any suggestions are welcome .
I think your lazy loading approach is fine to use in your actual training pipeline.
However, you would need to grab all target values once to set up the sampler.
If you’ve stored to weights, you could then just load them in your actual training and could stick to your Dataset implementation. Let me know, if this would work.