Are dictionaries/np array safe in torch datasets?

Maxime_Louis · July 2, 2020, 9:49am

Hey,

It may be a trivial question but I could not find a clear answer online.

I’m working with datasets of large images (loading each image takes a second) but the whole dataset can hold in memory, so I have decided to store the images in RAM instead of reading them on the fly.

I have my custom dataset class and so far I’m storing the raw images in a Manager.dict(), and in get_item I do augmentations/processing in numpy on these images before returning a tensor. So my questions are:

Can I use a simple python dictionary to store these images ? (meaning is it read/write safe when using it through a dataloader with multiple workers ?)
Same question for a large np array (which would contain all images) ?
Do you think it’s faster to load all images once at instanciation of the dataset, or to load and store them in RAM during the first call to get_item for each sample ?

Thank you in advance,

ptrblck · July 4, 2020, 6:51am

and 2.: since DataLoaders use multi-processing the Datasets will be copied and each worker will work in its own Dataset instance, so there shouldn’t be any read/write conflict. However, this would also mean that your memory usage should increase with the used number of workers.
You could compare both approaches, but note that the “lazy loading” will most likely not work using multiple workers, as the data would be stored on a replica and you would need to use shared arrays/dicts as given in this example. Also to avoid reloading the complete data, you could pre-load it outside of the Dataset.__init__ and just pass it as an argument. Otherwise the dataset will be recreated in each epoch.