Iter getitem from within self.getitem

ed-fish · May 20, 2021, 3:07pm

I am using __getitem__ to grab multiple embeddings from a file system and concat them into a single sample which is then sent to my dataloader for batching.

The dataset is not very well sanitised and occasionally I will need to skip a sample. Is there a way to return and skip a sample, or iterate __getitem__ from within the method itself?

Alternatively - should this code be put somewhere else?

Thanks for any help.

Ed

the-dharma-bum · May 20, 2021, 3:32pm

Could you provide a bit more informations about your use case ?

Why can’t you identify indices that correspond to corrupted file, for instance during the dataset instanciation, by looping one time over all the files, or even simplier identify those files before the dataset instanciation ?

In your training loop, I assume you’re doing something alike

for batch in dataloader:
    inputs, targets = batch
    outputs = model(inputs)
    ...

Can’t you detect batches to skip in this loop ?

If for some reasons you can’t do any of these propositions, you have to possibility to define a Sampler object that will be use by a DataLoader object to samples indices. Those indices will then be used to call the __getitem__ method of a Dataset object. Check the doc of pytorch.utils.data for more infos. Perhaps in the __next__ method of your sampler you could try and detect corrupted files and avoid yielding them so that the __getitem__ method will never be called with their corresponding indices.

I hope I’m being clear, feel free to ask for more details of course.

ed-fish · May 21, 2021, 10:42am

Thank you @the-dharma-bum for your reply.

It seems that a custom sampler is exactly what I require in this case. This will allow me to check that all samples in the [idx] folder have been processed correctly before sending to the dataloader.

This is a great solution for me as the dataset is being processed while I am experimenting with some model architectures. Identifying incompete samples during init wouldn’t be possible due to the dataset size.

Thank you for your help!

Iter __getitem__ from within self.__getitem__

Iter getitem from within self.getitem