Speed of loading customized dataset using pytorch dataloader

weedwind · June 30, 2017, 3:56am

Hi, I am creating my own data loader. However, I found that making a slight change could result in huge different on the speed of data loading. I really do not understand the reason. My data loader looks like this:

def load_file(filename):       # function to load one image saved in a dict
   with open(filename, 'rb') as fpik:
      data_dict = cPickle.load(fpik)
     return data_dict["data"], data_dict["label"]


class MyDataset(torch.utils.data.Dataset):      # My data set class
   def __init__(self):
       print 'loading data list...'
       self.data_files = glob.glob('train_dnn' + '/*' + '/*.pik')
   def __getitem__(self, idx):
       return load_file(self.data_files[idx])
   def __len__(self):
       return len(self.data_files)


def get_loader():    # data loader
   dset_train = MyDataset()
   loader_train = DataLoader(dset_train, batch_size = 256, shuffle = False, num_workers = 8, pin_memory = False)
   return loader_train

The images are saved in the subfolders of the directory ‘train_dnn’, and the subfolders are numbered 0, 1,…,200, with about 60,000 images in each subfolder. So, it is a very large database

If I just create a data loader by calling get_loader() as above, the speed of loading data batches is quite fast. But if I add shuffle(self.data_files) after self.data_files = glob.glob(‘train_dnn’ + ‘/’ + '/.pik’), then the speed of loading batches (not counting shuffling and glob) become very slow. In both cases, I set shuffle = False in the data loader itself.

I used 8 workers in both cases, and the model is a feed forward DNN trained on the batches. A geforce 980 card was used to train the model.

Anyone has any ideas?

bartolsthoorn · June 30, 2017, 4:56pm

Why do you set shuffle=False in the dataloader itself? The idea is that you can make the idx that is used to find the data_file random, so you don’t have to preshuffle them first in MyDataset.

Alternatively, you can pass your own Sampler instance to the Dataloader to control which samples you will get for each batch (another angle you can use to shuffle).

weedwind · June 30, 2017, 5:29pm

Thank you for your response. I did try setting Shuffle to True, so I do not need to shuffle the list myself. But setting shuffle to True caused a huge speed degradation with data loading compared with setting it to False. The strange thing is that even if I set shuffle to False, and shuffle data myself, the speed is still low. The fasted way is to set shuffle to false, and also do not shuffle data myself… I am really confused.

colesbury · June 30, 2017, 7:28pm

If shuffling slows down your data loading, it’s probably the random access to your hard disk that is slow. Move your data to an SSD.

weedwind · June 30, 2017, 8:43pm

Yeah, that’s probability the reason. The cpu usage of each worker is only 0.3%