Hello. I have several numpy .npy
dataset files. Each of them have thousands of samples, with shape like (num_sumples, channels, width, height)
.
If I use one of the files as training set and load it in Dataset
:
class MyDataset(Dataset):
"""npy dataset"""
def __init__(self, *args):
self.inputs = np.load(input_file, mmap_mode='r')
self.labels = np.load(label_file, mmap_mode='r')
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
X = torch.from_numpy(np.copy(self.inputs[idx]))
y = torch.from_numpy(np.copy(self.labels[idx]))
return X, y
trainset = MyDataset(*args)
trainloader = DataLoader(trainset, batch_size=256)
It trains very fast. But if I use Dataset
to load multiple npy
files and concat them to a concatDataset
, the speed slow down exponentially. Such as 1 epoch with 10 files will take 100x time than training with only 1 file.
How do I speed up the training process? I guess I can load the dataset into memory one file by one file. Is it possible?
Thank you!