The Data loading tutorial gives you a good overview how to write a custom Dataset and use the DataLoader with it.
In the default use case the Dataset would load and process a single samples in the __getitem__ method using the passed index and would initialize e.g. the data paths in its __init__ method.
Since your data already stores multiple samples, you could still use this lazy loading approach and preload the next file, if the current one doesn’t contain enough samples anymore.
The workflow would thus be:
load file0 with 50000 samples and keep it as an attribute
create batches of data from this file until it’s empty or the remaining number of samples is smaller than the batch size
load new file and repeat until all files were used.
The shortcoming of this approach would be that you wouldn’t be able to easily shuffle the data.
I.e. if you are using the passed index to decide if you should load the next data file, a shuffled index could trigger a constant file swap, which would yield a very bad performance. However, once a single file is loaded, you could create a lookup table with shuffled indices to at least shuffle the samples in each file.
Thank you for your reply! I read the tutorial and I have a question.
I also have a CSV file containing, for each row, the path to the image and the related label.
In the tutorial they say
We will read the csv in __init__ but leave the reading of images to __getitem__ . This is memory efficient because all the images are not stored in the memory at once but read as required.
Do you think it’s better (in memory and performance terms) for me using the tutorial approach instead of loading the files?