Hi all,
I’m new to PyTorch and I’m using a CNN for classification. The input sample is 192*288, and there are 12 channels in each sample. So each input sample will be around 2MB. I noticed there are some discussions about load data lazily, so I tried the following dataset.
class FileDataset(Dataset):
def __init__(self):
super(FileDataset, self).__init__()
self.Path = 'test_dataset/whole/'
self.pos_files = os.listdir(self.Path+'positive')
self.p_files = [os.path.join(self.Path+'positive', i) for i in self.pos_files]
self.neg_files = os.listdir(self.Path + 'negative')
self.n_files = [os.path.join(self.Path + 'negative', i) for i in self.neg_files]
self.files = self.n_files+self.p_files
def __len__(self):
return len(self.files)
def __getitem__(self, item):
path = self.files[item]
x = np.load(path); # print(x.shape)
x_t = torch.from_numpy(x)
return x_t
The data loading speed in DataLoader is very slow. GPU and CPU usage are both low.
I have also tried memmap in NumPy and HDF5. Still, the speed is not acceptable. The code is something like the following.
class MmapDataset(Dataset):
def __init__(self, ens, train=True):
super(MmapDataset, self).__init__()
if train:
self.x = np.memmap('large_test'+ens, mode='r', shape=(9760, 12, 192, 288), dtype='float32')
self.y = np.load('data/classification/Q850_train_y'+ens+'.npy')
else:
self.x = np.memmap('large_test_val'+ens, mode='r', shape=(610, 12, 192, 288), dtype='float32')
self.y = np.load('data/classification/Q850_test_y' + ens + '.npy')
def __getitem__(self, item):
x = self.x[item]
x = torch.from_numpy(x)
y = self.y[item]
return x, y
def __len__(self):
return self.x.shape[0]
I tried to create a huge memmap file or hdf5 file, and create several smaller files and use ConcatDataset
. The results are similar.
Does anyone have any idea about the potential improvement?
Thanks in advance!