I’ve created a huge dataset of 200go to train my GNN model. This dataset contains 1M of samples.
I would like to use Dataset, Dataloader class with it but the dataset is too large to fit into my memory.
I also tried to put all my samples in separate files to follow this solution but writing 1M files in my SSD made it crash.
Then, I just split my dataset into 20 files of 10 go that can fit into my memory but know I was wondering how can I load this file using the DataLoader class? Is my solution a good solution or should I do something else?
Maybe you could store your data into a HDF5 file, which allows you to read the data chunk-by-chunk, without loading everything to memory. HDF5 also supports compression, so the chunks do not occupy your whole SSD. h5py has a nice introduction to it, if you are interested.
In this case, your custom dataset could be something like:
class MyDataset(Dataset):
def __init__(self, hdf5_fpath):
super().__init__()
self._fpath = hdf5_fpath
def __len__(self):
with h5py.File(self._fpath, 'r') as h:
return h.shape[0]
def __getitem__(self, index):
with h5py.File(self._fpath, 'r') as h:
X = h['X'][index]
y = h['y'][index]
return X, y
which can be used to create a DataLoader as usual.