Model runs slower when adding more workers


In my model, there is a step where I am extracting features for each timestep of a sequence (line 2) that are then passed to my model in a next step (line 6):

>    1                           state_vector, local_costmap, global_costmap = \     
>    2                                 extract_features(grid, curr_pos, start_pos,     
>    3                                 goal_pos, prev_pos, node_list, batch_size, i,   
>    4                                 device, normalize_inputs)                       
>    5                                                                                 
>    6                             pred, hidden_state = model(state_vector,            
>    7                                                  local_costmap,                 
>    8                                                  global_costmap,                
>    9                                                  use_full_teacher_forcing,      
>   10                                                  hidden=hidden_state)

The extract_features() function performs a variety of operations (including a for loop over the batch size) to construct the inputs of the model for line 6. What I observe is that when I use a num_workers value of 0, the extract features method runs fast, but whenever I use any multiprocessing (num_workers>0), the extract features method runs slower. I am guessing that my extract_features() method is not conducive to parallel processing (it wasn’t designed for that either).

Therefore I am wondering if I can disable multiprocessing temporarily during the extract_features() step, and resume it once it is finished?

Is extract_features running on the CPU or another device? If it is on the CPU and using too many threads, you might check if e.g., using the OMP_NUM_THREADS=1 environment variable setting or some other small number of threads might help. If it is on another device (e.g., GPU) but is the last step in your data loading/input construction, I would consider simply moving it “downstream” e.g., as part of your model but with something like with torch.no_grad(): so that it is not affected by the num_workers parameter.

How can I know what device extract_features is running on? I would like to leverage multiprocessing with num_workers > 1 to load data faster, therefore I don’t think I want to set OMP_NUM_THREADS=1. I tried moving the extract_features inside of the forward() while also using with torch.no_grad(), if that’s what you meant by “moving it downstream”, however I still observe the same behavior as before.

OMP_NUM_THREADS doesn’t limit the number of processes (e.g., what num_workers does) but rather limits the amount of parallelism each worker uses, possibly reducing contention when more workers are used.