How to deal with a 200go dataset with 1M samples?


I’ve created a huge dataset of 200go to train my GNN model. This dataset contains 1M of samples.
I would like to use Dataset, Dataloader class with it but the dataset is too large to fit into my memory.
I also tried to put all my samples in separate files to follow this solution but writing 1M files in my SSD made it crash.

Then, I just split my dataset into 20 files of 10 go that can fit into my memory but know I was wondering how can I load this file using the DataLoader class? Is my solution a good solution or should I do something else?

Numpy’s memory map allows to read on-demand those segments that compose ur batch.


Maybe you could store your data into a HDF5 file, which allows you to read the data chunk-by-chunk, without loading everything to memory. HDF5 also supports compression, so the chunks do not occupy your whole SSD. h5py has a nice introduction to it, if you are interested.

In this case, your custom dataset could be something like:

class MyDataset(Dataset):
    def __init__(self, hdf5_fpath):

        self._fpath = hdf5_fpath

    def __len__(self):
        with h5py.File(self._fpath, 'r') as h:
            return h.shape[0]

    def __getitem__(self, index):
        with h5py.File(self._fpath, 'r') as h:
            X = h['X'][index]
            y = h['y'][index]

        return X, y

which can be used to create a DataLoader as usual.

1 Like