Handle big dataset with limited memory resources

Torcione · August 5, 2022, 9:45am

Hi everyone,
I’m trying to run my experiments on INaturalist dataset for the first time.
The latter is one of the built-in dataset available on pytorch (INaturalist).
When I arrive at the initialization of the dataset the code get stuck for a while; finally the execution stops and i get the message:

/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown

I read that a possible problem could be the fact that (torchvision.datasets.INaturalist) try to load the whole dataset in the memory during the init call (early loading), instead of performing the lazy loading.
My questions are:
Does my memory get saturated because (torchvision.datasets.INaturalist) performing early loading?
If yes, is it possible to easly modify the built-in class in order to pass to a lazy loading approach or I should define a new class to import the dataset? How should I proceed?