Efficient ways for DataSet and DataLoader

kartikpodugu · December 7, 2021, 11:52am

I created a custom dataset in 2 ways.

created a single csv file of all my training/testing data and then loaded this file in init, and getitem is very simple indexing.
created multiple small csv files for all my training/testing data. created a very simple init function, where I load some root folder name, no. of files in the folder and so on, and created a complicated getitem function, where based on the index of the training example requested, i calcualte the file which has the required training example, and then load the required data.

First way worked like a charm when data is very small. As data exploded, creating a single big file with all training examples was in itself a very time consuming one, and also memory requirements of the program also grew exponentially, causing data creation script to hang. But, once i created this big file, loading into program took time, but get item is very simple.

Second way, i need not create a merged file with all training examples. So, no big files. No big memory requirements in data creation and data loading. I thought, this is efficient way. But, second one increasing training times.

What is the best way to handle this issue?

mMagmer · December 7, 2021, 3:15pm

You can use bigger num_workers in dataloader to set how many subprocesses to use for data loading.
pin_memory=True can also help if you’re using gpu.
warning: it’s not working perfectly on windows machine. and you should read dataloader doc for more info about Platform-specific behaviors.

kartikpodugu · December 8, 2021, 9:05am

already using num_workers. Haven’t tried pin_memory. Will try that.