DataLoader with cache and shuffle implementation

Matteo_Ciotola · May 11, 2022, 8:51am

I would like to implement something equivalent to the tf.data.Dataset joint features: cache and shuffle.

More precisely, I would like to create a code, to speed up the training, equivalent to:

inp, gt = train_generator[0]
train_dataset = tf.data.Dataset.from_generator(train_generator,
                                               output_shapes=(inp.shape, gt.shape),
                                               output_types=(inp.dtype, gt.dtype)).batch(batch_size)
train_dataset = train_dataset.prefetch(4)
train_dataset = train_dataset.cache()
train_dataset = train_dataset.shuffle(len(train_generator) + 1, reshuffle_each_iteration=True)

I have already found some partial answers such as Best practice to cache the entire dataset during first epoch but I think none of them supports shuffle.

My starting point is my dataset class:

class MatDataset (Dataset):

    def __init__(self, img_path_list):
        self.img_path_list = img_path_list

    def __len__(self):
        return len(self.img_path_list)

    def __getitem__(self, index):
        temp = io.loadmat(self.img_path_list[index])
        I_PAN = temp['I_PAN']
        I_MS = temp['I_MS']
        # I_MS = interp23tap(I_MS, 4)
        I_MS = np.moveaxis(I_MS, -1, 0)
        I_PAN = np.expand_dims(I_PAN, 0)

        return torch.from_numpy(I_MS.astype(np.float32)), torch.from_numpy(I_PAN.astype(np.float32))

How may I do?

Thanks a lot!

nivek · May 11, 2022, 5:46pm

You can check out torchdata which introduces DataPipe as replacement for torch's Dataset. The built-in DataPipes have some of the functionalities you need (in-memory cache, on-disk cache, shuffle).

We do not have a DataPipe for prefetching yet, you can create your own or try using the built-in prefetching feature of DataLoader.