I created a custom dataset in 2 ways.
- created a single csv file of all my training/testing data and then loaded this file in init, and getitem is very simple indexing.
- created multiple small csv files for all my training/testing data. created a very simple init function, where I load some root folder name, no. of files in the folder and so on, and created a complicated getitem function, where based on the index of the training example requested, i calcualte the file which has the required training example, and then load the required data.
First way worked like a charm when data is very small. As data exploded, creating a single big file with all training examples was in itself a very time consuming one, and also memory requirements of the program also grew exponentially, causing data creation script to hang. But, once i created this big file, loading into program took time, but get item is very simple.
Second way, i need not create a merged file with all training examples. So, no big files. No big memory requirements in data creation and data loading. I thought, this is efficient way. But, second one increasing training times.
What is the best way to handle this issue?