Request help on data loader

rwightman · August 10, 2017, 5:06pm

That should be fairly straight forward to do as a Dataset, doubt you’d need to write a custom loader.

Create a dataset that initializes a Pandas DF (if you’ve got a csv file defining your mapping/splits) or good ol Python dict to map samples to a numpy file filenames and sample indices for each split that you can pass in to the Dataset constructor.

For __len__ method return the length of your mapping

For __getitem__, map index -> filename, sample_offset (index within numpy file), load sample from numpy file and return.

Now without knowing what your numpy files are like, there could be some serious performance issues with that, especially if they are compressed. Keeping it simple you could experiment with loading the numpy files using memory mapping (possibly keeping them all mapped). You could also experiment wrapping a load method that __getitem__ calls with @functools.lru_cache if those numpy files are compressed but not massive. If the files are both compress and huge, you could use the caching in conjunction with a custom Sampler for train that returns 1…n batches from the same sample file before sampling from another file.