Request help on data loader

glhfgg1024 · August 10, 2017, 3:48pm

Hi!

I generated a data set, which contains 10 numpy files, and each .npy file have 10000 samples. Are there any examples to load these kind of data sets for training?

The tutorial here http://pytorch.org/tutorials/beginner/data_loading_tutorial.html is very simple, but in my case, there are multiple splits of files. How could I easily create a FaceLandmarksDataset alike from Dataset base class?

I also read this example https://github.com/pytorch/vision/blob/master/torchvision/datasets/cifar.py, it seems all batch data are needed to first load into memory, which might be unpractical for large scale data, right?

Thanks.

rwightman · August 10, 2017, 5:06pm

That should be fairly straight forward to do as a Dataset, doubt you’d need to write a custom loader.

Create a dataset that initializes a Pandas DF (if you’ve got a csv file defining your mapping/splits) or good ol Python dict to map samples to a numpy file filenames and sample indices for each split that you can pass in to the Dataset constructor.

For __len__ method return the length of your mapping

For __getitem__, map index -> filename, sample_offset (index within numpy file), load sample from numpy file and return.

Now without knowing what your numpy files are like, there could be some serious performance issues with that, especially if they are compressed. Keeping it simple you could experiment with loading the numpy files using memory mapping (possibly keeping them all mapped). You could also experiment wrapping a load method that __getitem__ calls with @functools.lru_cache if those numpy files are compressed but not massive. If the files are both compress and huge, you could use the caching in conjunction with a custom Sampler for train that returns 1…n batches from the same sample file before sampling from another file.

glhfgg1024 · August 10, 2017, 5:51pm

Hi, @rwightman, thanks a lot for your answer and kind suggestions!

Yeah, the train split has ten numpy file, the validation split has two numpy file, and the test split has two numpy file.
Each numpy file has a numpy dictionary contains 10000 samples, with two keys: batch_x with size 10000x512x64 and batch_y with size 10000x256.

What you suggested above is too high-level for me, especially the caching-related.

In fact, my concern is on the batch inputs. If we set the batch size as 256, and the sample indices are randomly selected from 0~10*10000, then when the Dataset wants to generate this batch, he may need to read all ten numpy files and then get the corresponding samples to fill the batch. When using cache, the data in all ten numpy files will be cached, I think there is no difference between this kind of caching and reading all data into memory just once (like the way in https://github.com/pytorch/vision/blob/master/torchvision/datasets/cifar.py). Maybe I’m misunderstanding on caching.

rwightman · August 10, 2017, 6:22pm

Okay, that’s pretty big, yes.

The caching idea was for a situation where memory constraint wasn’t the primary concern. And the custom sampler can help you match your random access patterns during train to the cache policy (how many files you can cache at the same time).

Is it safe to assume your numpy files are .npy and not an .npz? You could try the simplest approach and use the memory-map flag when loading your numpy files and see how the performance is. Depending on the combined size off all the numpy files, you may be able to load all the numpy file mappings once and leave them open.

glhfgg1024 · August 10, 2017, 7:36pm

Yeah, got it. Thanks a lot.