The extract_features() function performs a variety of operations (including a for loop over the batch size) to construct the inputs of the model for line 6. What I observe is that when I use a num_workers value of 0, the extract features method runs fast, but whenever I use any multiprocessing (num_workers>0), the extract features method runs slower. I am guessing that my extract_features() method is not conducive to parallel processing (it wasn’t designed for that either).
Therefore I am wondering if I can disable multiprocessing temporarily during the extract_features() step, and resume it once it is finished?
Is extract_features running on the CPU or another device? If it is on the CPU and using too many threads, you might check if e.g., using the OMP_NUM_THREADS=1 environment variable setting or some other small number of threads might help. If it is on another device (e.g., GPU) but is the last step in your data loading/input construction, I would consider simply moving it “downstream” e.g., as part of your model but with something like with torch.no_grad(): so that it is not affected by the num_workers parameter.
How can I know what device extract_features is running on? I would like to leverage multiprocessing with num_workers > 1 to load data faster, therefore I don’t think I want to set OMP_NUM_THREADS=1. I tried moving the extract_features inside of the forward() while also using with torch.no_grad(), if that’s what you meant by “moving it downstream”, however I still observe the same behavior as before.
OMP_NUM_THREADS doesn’t limit the number of processes (e.g., what num_workers does) but rather limits the amount of parallelism each worker uses, possibly reducing contention when more workers are used.