Is DataLoader num_workers the number of parallel __getitem__ calls?


I have a situation with a vision training DataLoader taking a long time (T) and memory (M) to load at the beginning of each epoch. I notice that when we reduce num_workers by a factor of 4, T’=T/4 and M’=M/4.

I’d like to better understand how num_workers and getitem are connected: is num_workers the number of parallel process that individually each call getitem ?

What happens if num_workers > batch_size: does the DataLoader only call getitem the right amount of times? or will the DL do num_workers getitems and discard the useless records? or keep it for next batch? What happens then if prefetching is set to 1?

num_workers specifies the number of processes used to load and process the data.
Each process will call into Dataset.__getitem__ to create a full batch and depending on the prefetch_factor more samples will be preloaded and put into the queue.

When creating the iterator, the underlying Dataset is copied to each worker and the processes start to create the batches. Depending what Dataset.__init__ contains, these copies might be expensive, but since lazy loading is usually used it shouldn’t be a huge slowdown.
In any case, you could also use persistent_workers=True to avoid recreating the workers in each epoch.

Thanks @ptrblck I’m still not sure I understand - is it:

  1. A worker does a for loop calling __getitem__ b times to create a batch of size b on its own?
  2. or a worker calls __getitem__ once, and the results of the N parallel __getitem__ done by N workers are concatenated into batch of size b

or something else?

Approach 1 is used currently. I know there was a feature request for 2 but I don’t know what the status of it is.

1 Like