Dataloading takes really long time when the length of self.data in CustomDataset is huge(?)!

ardasnck · March 22, 2019, 8:25am

Hello All,

I’m experiencing a strange behavior while dataloader loading my data (calling get_item) if I use different lengths of data. My actual dataset is around 200K and self.data is holding list/np.array of dictionary items which each item has fixed number of key-val(data_file_name, data_file_path and etc.) pairs. In get_item: I simply do

sample = self.data[index]
np.load(sample.get('data_file_path'))

When I set my dataloader’s batch_size = 30 and num_workers = 4 and use full dataset, I get very slow dataloading process. (Left Image)
However, when I keep everything same (same CustomDataset, same DataLoader, batch_size = 30, num_workers = 4 and etc.) but use only subset of my dataset which is around 5k then dataloading process is very fast.(Right Image)

What would be the possible explanation for this? and Is there any workaround to solve this?

Kushaj · March 22, 2019, 8:56pm

What is the size of your dataset? The size of your 5K and 200K data. Maybe caching is helping you in case of smaller dataset.