I understand this could be an old topic, but I really want to know about the recent progress of Pytorch. Recently, I am training a model on Horovod (1 node, 4 gpus), and my dataloader uses 32 workers.
At this time, I am wondering if I can further improve the efficiency of data loading? One potential way is to implement a prefetcher, but on the internet there are some posts saying that Pytorch dataloader automatically uses prefetching.
Therefore, I am wondering:
- If implementing prefetching is still needed?
- If the answer is yes, are there any example implementations for the multi-gpu settings?
Thank you so much!