Hello. I have several numpy
.npy dataset files. Each of them have thousands of samples, with shape like
(num_sumples, channels, width, height).
If I use one of the files as training set and load it in
class MyDataset(Dataset): """npy dataset""" def __init__(self, *args): self.inputs = np.load(input_file, mmap_mode='r') self.labels = np.load(label_file, mmap_mode='r') def __len__(self): return len(self.labels) def __getitem__(self, idx): X = torch.from_numpy(np.copy(self.inputs[idx])) y = torch.from_numpy(np.copy(self.labels[idx])) return X, y trainset = MyDataset(*args) trainloader = DataLoader(trainset, batch_size=256)
It trains very fast. But if I use
Dataset to load multiple
npy files and concat them to a
concatDataset, the speed slow down exponentially. Such as 1 epoch with 10 files will take 100x time than training with only 1 file.
How do I speed up the training process? I guess I can load the dataset into memory one file by one file. Is it possible?