I am using
__getitem__ to grab multiple embeddings from a file system and concat them into a single sample which is then sent to my dataloader for batching.
The dataset is not very well sanitised and occasionally I will need to skip a sample. Is there a way to return and skip a sample, or iterate
__getitem__ from within the method itself?
Alternatively - should this code be put somewhere else?
Thanks for any help.
Could you provide a bit more informations about your use case ?
Why can’t you identify indices that correspond to corrupted file, for instance during the dataset instanciation, by looping one time over all the files, or even simplier identify those files before the dataset instanciation ?
In your training loop, I assume you’re doing something alike
for batch in dataloader:
inputs, targets = batch
outputs = model(inputs)
Can’t you detect batches to skip in this loop ?
If for some reasons you can’t do any of these propositions, you have to possibility to define a Sampler object that will be use by a DataLoader object to samples indices. Those indices will then be used to call the
__getitem__ method of a Dataset object. Check the doc of pytorch.utils.data for more infos. Perhaps in the
__next__ method of your sampler you could try and detect corrupted files and avoid yielding them so that the
__getitem__ method will never be called with their corresponding indices.
I hope I’m being clear, feel free to ask for more details of course.
Thank you @the-dharma-bum for your reply.
It seems that a
custom sampler is exactly what I require in this case. This will allow me to check that all samples in the
[idx] folder have been processed correctly before sending to the
This is a great solution for me as the dataset is being processed while I am experimenting with some model architectures. Identifying incompete samples during
init wouldn’t be possible due to the dataset size.
Thank you for your help!