Proper way to load large NumPy files as training samples

UCDuan · October 15, 2020, 5:42pm

Hi all,
I’m new to PyTorch and I’m using a CNN for classification. The input sample is 192*288, and there are 12 channels in each sample. So each input sample will be around 2MB. I noticed there are some discussions about load data lazily, so I tried the following dataset.

class FileDataset(Dataset):
    def __init__(self):
        super(FileDataset, self).__init__()
        self.Path = 'test_dataset/whole/'
        self.pos_files = os.listdir(self.Path+'positive')
        self.p_files = [os.path.join(self.Path+'positive', i) for i in self.pos_files]
        self.neg_files = os.listdir(self.Path + 'negative')
        self.n_files = [os.path.join(self.Path + 'negative', i) for i in self.neg_files]
        self.files = self.n_files+self.p_files
    def __len__(self):
        return len(self.files)
    def __getitem__(self, item):
        path = self.files[item]
        x = np.load(path); # print(x.shape)
        x_t = torch.from_numpy(x)
        return x_t

The data loading speed in DataLoader is very slow. GPU and CPU usage are both low.

I have also tried memmap in NumPy and HDF5. Still, the speed is not acceptable. The code is something like the following.

class MmapDataset(Dataset):
    def __init__(self, ens, train=True):
        super(MmapDataset, self).__init__()
        if train:
            self.x = np.memmap('large_test'+ens, mode='r', shape=(9760, 12, 192, 288), dtype='float32')
            self.y = np.load('data/classification/Q850_train_y'+ens+'.npy')
        else:
            self.x = np.memmap('large_test_val'+ens, mode='r', shape=(610, 12, 192, 288), dtype='float32')
            self.y = np.load('data/classification/Q850_test_y' + ens + '.npy')
    def __getitem__(self, item):
        x = self.x[item]
        x = torch.from_numpy(x)
        y = self.y[item]
        return x, y
    def __len__(self):
        return self.x.shape[0]

I tried to create a huge memmap file or hdf5 file, and create several smaller files and use ConcatDataset. The results are similar.

Does anyone have any idea about the potential improvement?
Thanks in advance!

albanD · October 15, 2020, 9:12pm

Hi,

You might want to use torch.save/torch.load directly to reduce intermediary states in the loading.
Also adding more workers to the dataloader will help with loading things faster.
Finally, make sure to use an ssd if possible as it makes huge difference compared to a spinning disk.

UCDuan · October 15, 2020, 9:51pm

Hi thanks for your reply.

Do you mean I should save the data with torch.save instead of NumPy files? I tried different number of workers, but it is still slow.

albanD · October 15, 2020, 10:21pm

Yes you can save it in torch format to make sure that you don’t need the extra hop when loading between numpy and torch.

Reading many small files from disk is slow (especially for spinning disks) there is no way around that I’m afraid. You can increase the number of workers until it starts slowing down. But beyond that there isn’t much you can do if you dataset doesn’t fit in memory.