Disregard some images from the dataloader

Suppose I have a train dataloader that yields batches of images for training. After some time of training I want to discard a certain group of images and I want the dataloader to only provide me with the images that I consider worth training on.

How can I realize this with a Sampler or Loader? It would be good to dynamically change the set of relevant images efficiently.

Thanks a lot!

A very quick way could be to use WeightedRandomSample with very small / 0 weight for the bits you don’t want.
Note that things might break if you change the epoch length and tools expect the epochs to be of fixed length (len(dl)).

Best regards

Thomas

Thanks for the quick answer. WeightedRandomSampler sounds promising, however I plan on changing the set of used samples quite often.

So I think it makes more sense to set some kind of boolean tensor within the dataloader and then modify the dataloader instead of the sampler.

Last I looked, the dataloader would call the sampler at the beginning of an epoch. Given that instantiating a dataloader isn’t that much more expensive than starting an epoch (i.e. instantiating the iterator), I’m not sure that what you have in mind is easy to implement or much more efficient.

The obvious other solution is to forego the dataloading and produce a dataset (possibly an iterative dataset) that accomplishes this (in a well-optimized pipeline having multiple workers might not be as crucial as it seems to be initially).

That said, while I do believe that using the sampler is likely a good idea, it’s totally possible that you’ll have a much better solution matching your workflow. And if that’s the case, I’d be really keen to hear how you did it.

Best regards

Thomas

1 Like

Dear Thomas, thanks again for the quick reply, really appreciate it.

In the meanwhile, I found this post. Apparently I can simply change the dataset.

The torchvision datasets seem to contain the attribute self.data which I will try to backup in a different variable and overwrite by a subset of the original data. Hope this will lead to an efficient training on a smaller training set.

Will get back to you once I see how this goes, in case you have any concerns please let me know!
Best,
Max

Yeah, changing the dataset is always a good if a bit hacky method to fix things.
One thing to keep in mind is that it won’t propagate to the workers of a dataloader until they are restarted.

Not sure if this is true though, I see two posts claiming that this will work even with multiple workers, see this answer or this one for example.

In any case, will verify this! Thanks for the advice

Note that the linked post mention the Dataset manipulation has to be done after each epoch and without using persistent workers, so @tom’s warning fits into this.

1 Like

@ptrblck Yes I noticed that, thanks for pointing out again.

@tom In case you’re still interested, I basically just used this solution and modified it as such that I simply create a backup copy of the data and target tensors to which I can modify and to which I can revert at any time.

Thanks for your help!

1 Like