Accelerating Dataloading with Prefetching

I understand this could be an old topic, but I really want to know about the recent progress of Pytorch. Recently, I am training a model on Horovod (1 node, 4 gpus), and my dataloader uses 32 workers.

At this time, I am wondering if I can further improve the efficiency of data loading? One potential way is to implement a prefetcher, but on the internet there are some posts saying that Pytorch dataloader automatically uses prefetching.

Therefore, I am wondering:

  • If implementing prefetching is still needed?
  • If the answer is yes, are there any example implementations for the multi-gpu settings?

Thank you so much!