Hi, I am creating my own data loader. However, I found that making a slight change could result in huge different on the speed of data loading. I really do not understand the reason. My data loader looks like this:
def load_file(filename): # function to load one image saved in a dict
with open(filename, 'rb') as fpik:
data_dict = cPickle.load(fpik)
return data_dict["data"], data_dict["label"]
class MyDataset(torch.utils.data.Dataset): # My data set class
def __init__(self):
print 'loading data list...'
self.data_files = glob.glob('train_dnn' + '/*' + '/*.pik')
def __getitem__(self, idx):
return load_file(self.data_files[idx])
def __len__(self):
return len(self.data_files)
def get_loader(): # data loader
dset_train = MyDataset()
loader_train = DataLoader(dset_train, batch_size = 256, shuffle = False, num_workers = 8, pin_memory = False)
return loader_train
The images are saved in the subfolders of the directory ‘train_dnn’, and the subfolders are numbered 0, 1,…,200, with about 60,000 images in each subfolder. So, it is a very large database
If I just create a data loader by calling get_loader() as above, the speed of loading data batches is quite fast. But if I add shuffle(self.data_files) after self.data_files = glob.glob(‘train_dnn’ + ‘/’ + '/.pik’), then the speed of loading batches (not counting shuffling and glob) become very slow. In both cases, I set shuffle = False in the data loader itself.
I used 8 workers in both cases, and the model is a feed forward DNN trained on the batches. A geforce 980 card was used to train the model.
Anyone has any ideas?