Efficient ways for DataSet and DataLoader

I created a custom dataset in 2 ways.

  1. created a single csv file of all my training/testing data and then loaded this file in init, and getitem is very simple indexing.
  2. created multiple small csv files for all my training/testing data. created a very simple init function, where I load some root folder name, no. of files in the folder and so on, and created a complicated getitem function, where based on the index of the training example requested, i calcualte the file which has the required training example, and then load the required data.

First way worked like a charm when data is very small. As data exploded, creating a single big file with all training examples was in itself a very time consuming one, and also memory requirements of the program also grew exponentially, causing data creation script to hang. But, once i created this big file, loading into program took time, but get item is very simple.

Second way, i need not create a merged file with all training examples. So, no big files. No big memory requirements in data creation and data loading. I thought, this is efficient way. But, second one increasing training times.

What is the best way to handle this issue?

You can use bigger num_workers in dataloader to set how many subprocesses to use for data loading.
pin_memory=True can also help if you’re using gpu.
warning: it’s not working perfectly on windows machine. and you should read dataloader doc for more info about Platform-specific behaviors.

1 Like

already using num_workers. Haven’t tried pin_memory. Will try that.