Is there a way to make Dataloader use python threads instead of multiprocess?
No, there is currently no way to use multi-threads. As far as I know, a multi-threaded loader would run into the Python GIL, which would yield a bad performance, thus multi-processing is used.
I think there’s a couple catches to the GIL where you can still get good parallel performance using python threads. I could be wrong, but here is what I believe to be true:
- The GIL simulates compute bound tasks of multiple threads through preemptive multitasking. So if you have N processors, you will not get an N-fold speedup.
- The GIL will yield threads if they get put into the wait state from I/O bound operations. This is handy if you are reading off of a network drive with large latency. This is my situation which is why I asked.
- The GIL can be released by lower level libraries effectively removing the issue with item 1. I believe OpenCV does this. So, my thought is, if my understanding is correct, using OpenCV to do your data augmentation should yield the GIL and thus using multiple threads should give you a speedup.
I am developing in windows and have found multiprocess to be a bit of a headache. For example, different versions of python copy over different parts of the context when making a new process. In generally, process making on Windows isnt as slick as it is on Linux.
Anyone out there reading this, please let me know if something I stated above is incorrect.
I am interested in the same topic, if my dataloader is IO bound (for example because I am reading a lot of small arrays from a slow hard drive) I believe that IO performances would greatly increase using more workers then avialable cpu cores, thus having a thread based data loader could make sense.