Is it wise to put 10GB data in Dataset?

I have a large data (10GB), each data sample is about 100MB, I want to know whether it is a wise method to preload all data

I tried to load everything first and store them as a property of Dataset, and the getitem function only slice a piece of it and do data augmentation.

Since the data augmentation is time-costing, I used multi cores in data_loader, but I found that it becomes very slow.

I wonder whether do those sub-processes share the Dataset memory? Or the dataset was copied several times to different sub-processes?

Thank you

Storing 10GB in memory is never a good idea, I recommend you write the samples to csv-files or something alike an then use a read function to read the samples from the csv-files in __getitem(self, i)__

I would also load the data lazily as @string111 suggested. Especially in the beginning when you are experimenting with different hyperparameters it can take a lot of time to load the whole dataset just to realize after a few iterations that your model won’t learn anything.