Using DataLoader when all data is stored in memory

Jendker · August 10, 2021, 9:49pm

Hello! This is my first post on this forum.

I wanted to ask you if you see any use of utilizing DataLoader for parallel data loading into batches if all the data is already stored in the memory? Is the data actually prepared in advance or is it split into batches still done in sequence with the main loop with only data loading split into separate loaders?

ptrblck · August 12, 2021, 5:24am

Each worker in the DataLoader will create a batch by calling into Dataset.__getitem__ with the corresponding index and will add the batch to the queue.
It depends on your use case, if using a DataLoader (with multiple workers) would yield a speedup.
E.g. if you are preloading the entire dataset, are just returning the sample in __getitem__, and don’t need a custom collate_fn or a specific sampler, the DataLoader's overhead might be visible and your use case might be faster by directly indexing the Dataset with (random) indices.
However, if you are adding some transformations to each preloaded sample etc. the DataLoader might still be beneficial, especially if the augmentation is done in the background, so you would need to profile both approaches.