Proper way to create new dataset

antspy · December 10, 2017, 3:45pm

Assume I want to create a new dataset. My dataset consists of a N “feature tensors” and N “label tensors”.

How would I go about creating a dataset from these tensor? What is the proper way of storing it to disk, and then retrieving it?

I could save everything into a file using torch.save, and then subclass the Dataset class and implement __getitem__ by simply loading the whole dataset from disk, and return the correct element.

For N very big, though, this could be problematic; it will be very slow and occupy a lot of memory. So is there a better way, that avoids loading everything into memory, and potentially also play nice with having to retrieve batches of random elements?

SimonW · December 10, 2017, 6:06pm

The dataset api is very very general. You can do what ever you want with it. If N is large, then each data can be a file, or an entry in a database, or a file in a remote HDD.