Hello,
I am working a classification problem, which includes the feature generation at first and then classification. Due to my problem constrain, i generate the features first and then train classifier separately. However, dataloading has become the bottleneck of my pipeline and progress.
I save the generated datasamples as list of tensors, as .pt file having dimensions [(50,10,10,10), (1)], the last (1) being the associated label tensor.
I use a standard dataset and dataloader code:
class LR_Dataset(Dataset):
def __init__(self, filepath):
self.filepath = filepath
self.filenames = os.listdir(self.filepath)
def __len__(self):
return len(self.filenames)
def __getitem__(self, idx):
x,y = torch.load(os.path.join(self.filepath,self.filenames[idx]))
return x,y
def dataloader(filepath, batch_size, num_workers=0):
dataset = LR_Dataset(filepath)
return DataLoader(dataset, batch_size=batch_size, num_workers=num_workers)
Note that: the dataset folder contains ~150,000 .pt files. So i doubt if reading .pt file is taking the time, or is it because the .pt files contain tensors and not numpy arrays.
I am facing a tough deadline. Any help would be greatly appreciated.
Thank you