Custom Dataset w/ Specific Requirements

For a research project, I need to load a dataset of images, and numerical labels for drive and motor data. Each image is also labeled with a flag, and I need to be able to set which image flags to ignore and keep during runtime. Currently the data loading system was written and optimized for Caffe and uses the hdf5 format for extracting these files. During training, extracting and converting the hdf5 data is the main bottleneck. Because of the ignore flag feature, I was not able to use the default PyTorch Dataset to dynamically load the data, since a single call to the get item in the dataset must return an item at each call. I am looking to convert our current dataset into a format that would be optimized for loading in to PyTorch, that also supports loading a frame and skipping it if it has an ignore flag set. My current code for importing the dataset and using it for training are available here:

###Loading hdf5 Dataset w/ Ignore List

###Use of Dataset in Traning Code

What would be the best format to use for my particular use case? Are there any examples of setting up a similar type of dataset?

In general, I write my own Dataset class that inherits from the PyTorch Dataset and it handles all the logic of what data and labels to feed to the network when. Then the PyTorch Data Loader doesn’t have to know about any of that, it just loads pairs.

Disclaimer: I didn’t read the code, so I’m not sure precisely what the problem is.

Sorry if I wasn’t being clear. My specific problem is that the PyTorch Dataset class has a get_item function that requires an index. My problem is that I don’t know if data at a certain index will be used or not until I load it and check it’s flag to see if I should ignore it. Each time get_item is called I need to return something, so this hasn’t been working for me. Is there a way I can tell the pytorch dataset class to skip a particular index after a call to get_item?

It only takes an index so that the DataLoader can load a certain number of images during training (e.g. one epoch worth of images). get_item could essentially ignore the index and iteratively load data, check the ignore flag, and only return the data if ignore is False. That’s probably better than skipping the index, because you’ll actually go through the same number of datapoints each time you call the DataLoader.

How would the DataLoader know when I’m out of data? If the index is ignored and I just iteratively load data until it’s over, how can I signal to the data loader that I am done going through the dataset? Is it possible for get_item to return None and the DataLoader to ignore that index? I have my dataset indexed it’s just the ignore list that needs to happen dynamically.

The DataLoader samples data points until it has selected len(dataset) number of samples. So you could just set the length of your dataset to be a fixed number (by overriding the __len__ method).

With this approach could I continue to have epoch training behavior?
Currently I have some saving and validation code that should run after each
epoch of data is shown.