What are the (dis) advantages of persistent_workers

Hello,

What does persistent_workers do? What are the implications on I/O speed and RAM consumption?

4 Likes

Hi,

With this option to false, every time your code hits a line line for sample in dataloader:, it will create a brand new set of workers to do this loading and will kill them on exit.
Meaning that if you have multiple dataloaders, the workers will be killed when you are done with one instantly.

If you make them persist, these workers will stay around (with their state) waiting for another call into that dataloader.

Setting this to True will improve performances when you call into the dataloader multiple times in a row (as creating the workers is expensive). But it also means that the dataloader will have some persistent state even when it is not used (which can use some RAM depending on your dataset).

5 Likes

some follow-up questions
does setting persistent workers to True cancel the re-shuffling of the dataloader each epoch?

specific to me:
i’m running a heavy training protocol (big 3D input samples) but my dataloading is quite straight-forward, one dataloader for training and one for validation. they are re-initialized (with for sample in dataloader:) every epoch and i notice they take some time, sometimes up to 10 seconds (when ran in debugger) - do you think in this case it is correct to set persistence to true?

1 Like

No it should not change how randomness is done.

If the persistent state is not an issue for you, you should definitly enable it yes.

1 Like

Could you give an example situation in which such a persistent state of the dataloader becomes an issue? In my case I’m preloading my entire dataset in RAM and use enumerate(dataloader) to extract mini-batches.
Would it then be wise to use the persistent state option?

1 Like

It will be an issue if you use the dataloader only once or have a large number of them that you rotate and so you don’t want to have all of them with their state in memory at the same time.
But if you have a single one in your case, you should definitely use it yes.

5 Likes

I can imagine that if you just have a preprocessed dataset where you just need to access its __getitem__ from within memory, that one would rather set num_workers=0 and persistent workers=False. The overhead of multiple workers will not contribute anything in that case because no processing needs to be done except for collating which I am not sure is very expensive.

Does that make sense, or am I wrong?

1 Like

If you’re using num_workers=0, there are no worker processes, so the persistent worker flag will have no effect at all :slight_smile:
But indeed, if your dataset is completely in memory and you don’t have heavy preprocessing, then you can use 0 workers and it will run just fine.

3 Likes

@albanD , is it possible to shut down the workers later on? I have a situation where I want to have persistent workers, but after I am done with dataloader X, I want all of its footprint to go away. I am not sure if del dataloader fixes it. Any ideas? Thanks!

1 Like

Yes, it is possible. When you enable persistent workers, you need to del dl. _iterator or dl._iterator. _shutdown_workers() to remove it from your processes.

2 Likes

@albanD , thank you for answering the thread. I’m especially glad to hear that persistent_workers allows workers to stick around while iterating over the dataset (e.g. for image, target in dataloader). From the documentation, I was under the impression that workers stick around until the entire dataset is consumed once:

        persistent_workers (bool, optional): If ``True``, the data loader will not shutdown
            the worker processes after a dataset has been consumed once. This allows to
            maintain the workers `Dataset` instances alive. (default: ``False``)

How can I enable persistent_workers but also able to change worker’s attribute?
For example, in my dataset, I have an attribute called epoch, which means what epoch it is right now, I update this attribute while training.
But since I set persistent_workers to True, the worker will be the same all the time with epoch equal to the initialization value. How can I change this behavior without turning persistent_workers to False?

In case anybody has the same situation as I am, here is the solution:Change worker process's attribute while set persistent_worker = True · Issue #97326 · pytorch/pytorch · GitHub

Just wanted to throw out my own use-case and why it is valuable (and a time-saver) for me to actually have persistent_workers set to False.

The reason is that I have an “active learning” process where the dataset labels can change every epoch.

All of the samples themselves are cached locally in shared memory, but the LABELS are cleaned up as active learning proceeds so these labels need to be pulled down from an external database. This pulling of the new labels occurs once per epoch, and is a relatively fast operation, a call to the database. This call is triggered by calling a function on the Dataset owned by the DataLoader.

By setting persistent_workers to False, it guarantees that a NEW instance of the Dataset is pickled out to every worker at the start of the epoch, based on the current instance of the Dataset in the DataLoader, with the updated labels from the database.

This would not be possible with persistent_workers set to True (at least without digging into the internals, which I’d rather not do).