A clear explanation of what num_workers=0 means for a DataLoader

Machupicchu · April 15, 2023, 11:04pm

Hello,

the pytorch documentation it says that setting num_workers=0 for a DataLoader causes it to be handled by the “main process” from the pytorch doc:

" 0 means that the data will be loaded in the main process."

maybe i’m wrong but usually i find that the pytorch doc gives often (but not always of course) many useless or obvious info but does not mention the only useful points that i m looking for…and its quite frustrating some times.

… by reading a bit more in other places that seems to mean no multiprocessing i.e. single-process. ok. But does it mean that the whole dataset (all the minibatches) are loaded into the main memory (RAM) ?

can you confirm that this is the case (this is actually what i want because i have a dataset of around 1 million files) which is extremely slow for even a 16-worker dataloader to load and i can afford having the whole dataset (around 32 GB) loaded into RAM so the minibatches would be fetched quite fast after that.

From what i read elsewhere it seems to be the case 0 meaning loaded into RAM but it would be great if the pytorch experts of this forum can confirm

P.S.
the pin_memory=True and setting e.g. a prefetch_factor=32 or something does not help, so ideally and since i can afford it : load the whole dataset in RAM but “transparently” i.e. if i can just set num_workers=0 that would be the dream.

ptrblck · April 15, 2023, 11:09pm

The DataLoader is not deciding how and when the actual samples are loaded, which is defined in the Dataset. Depending how the Dataset.__init__ and .__getitem__ are implemented the entire internal data could be preloaded or each sample could be lazily loaded.

The DataLoader will call into Dataset.__getitem__ to load each sample and will create the batch using the collate_fn. Each worker will create a copy of the Dataset (assuming you are using num_workers>0).

This is defined by the Dataset as mentioned above.

That’s not the case.

Improvements to the docs are always welcome, so feel free to submit fixes or missing information.

Machupicchu · April 15, 2023, 11:12pm

oh thanks for the very fast answer. I did implement the init and getitem and i m actually listing all the files from a folder and then the getitem returns a minibatch that i form from the contents of the file…
i was under the impresseion (but misled by a certain GPT!) that setting num_workers=0 would “magically” i.e. internally cause the dataloader to pre_load the whole dataset in RAM

chatGPT pretends that:

from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor

train_data = MNIST('data/', train=True, download=True, transform=ToTensor())
train_loader = DataLoader(train_data, batch_size=32, shuffle=True, num_workers=0)

# All minibatches are loaded into RAM before training starts
for batch in train_loader:
    # Do something with the batch
    pass

so GPT is saying nonsense (again) ? XD

(i could load it all manually it’s true, but each file is 1MB so id need to open it , do some preproc extract the data etc… i.e. do what is done in the getitem sort of… and save it in a variable i guess… but that would require some more code to write and i hoped but maybe just in my (chatGPT’s) dreams that num_workers=0 would do it)

ptrblck · April 15, 2023, 11:21pm

The comment is true, since the MNIST dataset will preload all samples as seen here. The __getitem__ will then index this preloaded data here and process it afterwards.
You can access the preloaded data also via:

train_data.data.shape
# torch.Size([60000, 28, 28])

However, this behavior is not defined by the DataLoader or its number of workers, but by the Dataset (MNIST in this case).

Machupicchu · April 15, 2023, 11:23pm

oh i see of course with the dummy MNIST example… i have my own train_data that i wrote of course

the funny thing is that for e.g. 350k files the perf is only slow for the first epoch and then the data seems to be in RAM and everything is much faster the next epoch… (this is the case even when not specifying pin_memory or prefetch_factor) however when going to aroung 700k it remains be slow even after the 0,1,2 epochs … do you have an idea why? is it like there is a “threshold” around 700k that is just too much to handle maybe? I was under the impression that after the first epoch if all the minibatches can stay in RAM, then they should remain there, which seems to be the case with 350k but not 700k

ptrblck · April 16, 2023, 6:49pm

I also don’t know where this understanding comes from but the DataLoader is also not caching the data in any way. It would still be your responsibility to write a caching mechanism into your custom Dataset.

Machupicchu · April 16, 2023, 7:28pm

well I said that because I noticed it on my computer: first epoch → very slow, 2nd-last epochs → much faster… so clearly some type of caching was happening … but what triggered it ?

Machupicchu · April 17, 2023, 11:04pm

I mean of course I am sure you are right about the DataLoader so the “sort of caching mechanism” that i did notice must come from somewhere else… do you think it could be a “smart feature” of the OS (Ubuntu in my case) ?!

ptrblck · April 18, 2023, 12:43am

Yes, your OS could also cache the data locally but I would also guess the cache might still be valid after the initial run (i.e. even if you rerun the same Python script).
To clear the cache you could run /sbin/sysctl vm.drop_caches=3 between each Python script execution or epoch.