Proper way to create new dataset

Assume I want to create a new dataset. My dataset consists of a N “feature tensors” and N “label tensors”.

How would I go about creating a dataset from these tensor? What is the proper way of storing it to disk, and then retrieving it?

I could save everything into a file using, and then subclass the Dataset class and implement __getitem__ by simply loading the whole dataset from disk, and return the correct element.

For N very big, though, this could be problematic; it will be very slow and occupy a lot of memory. So is there a better way, that avoids loading everything into memory, and potentially also play nice with having to retrieve batches of random elements?

The dataset api is very very general. You can do what ever you want with it. If N is large, then each data can be a file, or an entry in a database, or a file in a remote HDD.

1 Like