Selecting a num_workers is pretty tricky and as I migrated slowly to pytorchLightining it gives you a warning with suitable number of num_workers depending on your hardware and data. But in pytorch I think as of now it’s a trail and error.
For me, increasing num_workers reduces data loading per batch, but also occasionally slows down so much that e.g. per 100 batches, it is slower than when num_workers=0. I haven’t figured out what can cause these hiccups.
I’ve been also experiencing the same issue for a while. I am not even sure I ever truly benefited using multiple-workers since I noticed this problem rather late. I have 8 cores and I have tried running with different number of workers 0, 1, 2, …, 8. The main thread (0 workers) gave me the fastest loading consistently. This is also the case for the data that is pre-loaded in the memory.
Same issue here using PyTorch 1.12.0. PytorchLightning throws a PossibleUserWarning and suggests to use 8 workers (which is the number of cores in my M1 CPU), but doing so results in a huge slow down.
Same behaviour on Windows 10 Pro, Pytorch Lightning 2.0.4, Torch 2.0.0+cu117
Got 20 cores and putting num_workers on 20 causes a slowdown of several minutes between each epoch. Putting num_workers on 1 or 2 does already lead to a much better result with a slowdown of 20 seconds. With num_workers at 0 I’ve received the by far best results with a slowdown of maybe 2-3 seconds at most.
Same on Ubuntu 20.04.5 LTS, using pytorch 2.1.0 and lightning 2.1.0. Do we have any updates on this? Is there any guideline as to when we should set num_workers > 0?
For people arriving here looking for an answer, the general recommendation of num_workers = number of CPU threads is not valid in many usecases.
In the use case mentioned in this post, since data is already in memory I would guess the overhead of spinning multiple processes makes using parallel loading unfeasible.
Another use case that makes using num_workers difficult is when dealing with 3D data, where significant file reading concurrency might actually make loading slower than when using for example, 1 worker.
In most cases even if 2 workers is slower, 1 should be better than 0, since when using 0 you are competing with the code in your main training loop. However, some image processing functions from data augmentation libraries work faster when called in the main thread than when called in sub processes. In sub processes they are limited to using 100% of one thread, while I have seen the same code using full multithreading in the underlying C implementations only when called with num_workers 0. I don’t know why that happens.
So, this is a mess and really should only be determined through experimentation for each use case. Its way more complicated than it looks!