Fastest way of dataloading large datasets


I am working a classification problem, which includes the feature generation at first and then classification. Due to my problem constrain, i generate the features first and then train classifier separately. However, dataloading has become the bottleneck of my pipeline and progress.

I save the generated datasamples as list of tensors, as .pt file having dimensions [(50,10,10,10), (1)], the last (1) being the associated label tensor.
I use a standard dataset and dataloader code:

class LR_Dataset(Dataset):
    def __init__(self, filepath):
        self.filepath = filepath
        self.filenames = os.listdir(self.filepath)

    def __len__(self):
    	return len(self.filenames)

    def __getitem__(self, idx):
    	x,y = torch.load(os.path.join(self.filepath,self.filenames[idx]))
    	return x,y

def dataloader(filepath, batch_size, num_workers=0):
    dataset = LR_Dataset(filepath)
    return DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)

Note that: the dataset folder contains ~150,000 .pt files. So i doubt if reading .pt file is taking the time, or is it because the .pt files contain tensors and not numpy arrays.

I am facing a tough deadline. Any help would be greatly appreciated.

Thank you


Please help! :pray:t2:

I don’t think there is any overhead from using tensors instead of np arrays.

You could try increasing num_workers > 0 to use multiprocessing in DataLoader

As @Dipayan_Das explained multiple workers might give you a speedup.
If that’s not helping, you could try to preload the complete dataset, which should take approx. 27GB, if my calculation is right.

Thanks a lot for considering, but I have a limited RAM of 16GB and 8GB GPU. Any other method other than loading whole data to RAM shall do. Additionally, i have indeed experimented with num_workers, however, the effect is not at all significant.

Thank you