Using `num_workers > 0` for CPU-devices?

Hi,

when I do the training of my model on a CPU-only device, does it make sense to choose num_workers > 0? From the documentation, it’s not entirely clear to me, as it only states:

[…] how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)

I’ve also taken a brief look at the source code, but again, it is not obvious to me whether num_workers > 0 would make sense on a CPU-only device or not.

(For pin_memory, on the other hand, it’s pretty clear this only makes sense when the training is done on GPU, since it states:

If True, the data loader will copy Tensors into CUDA pinned memory before returning them.

)

Yes it could make sense to use multiple workers, as they would load the next batch(es) in the background while the model is being trained. Depending which operations are used to train the model and if they are able to use all CPU cores, you might or might not see a speedup, so you should check different configs for your actual workload.