I am working a classification problem, which includes the feature generation at first and then classification. Due to my problem constrain, i generate the features first and then train classifier separately. However, dataloading has become the bottleneck of my pipeline and progress.
I save the generated datasamples as list of tensors, as .pt file having dimensions [(50,10,10,10), (1)], the last (1) being the associated label tensor.
I use a standard dataset and dataloader code:
class LR_Dataset(Dataset): def __init__(self, filepath): self.filepath = filepath self.filenames = os.listdir(self.filepath) def __len__(self): return len(self.filenames) def __getitem__(self, idx): x,y = torch.load(os.path.join(self.filepath,self.filenames[idx])) return x,y def dataloader(filepath, batch_size, num_workers=0): dataset = LR_Dataset(filepath) return DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)
Note that: the dataset folder contains ~150,000 .pt files. So i doubt if reading .pt file is taking the time, or is it because the .pt files contain tensors and not numpy arrays.
I am facing a tough deadline. Any help would be greatly appreciated.