Hi I have a folder of a lot of CSV files which cannot be loaded into memory. How do I load this data into multi-GPU training?
What I have done
- At the moment, I’m loading this data into my training application using TorchData datapipes (Example tutorial).
- As my full production dataset is too big to fit all into memory and I’d like to accelerate my training, I’d like to train it using multiple GPUs, which each GPU only reading a subset of the data.
- To do that, I’m trying to use DistributedSampler alongside with DistributedDataParallel strategy
However, I’m getting the error " MapperIterDataPipe has no len" which of course it hasn’t because the full data size is not known in advanced. What’s the best strategy then to stream CSV records into Distributed GPU training in Pytorch?
In tensorflow, there’s a DistributedDataset and DistributedMirrorStrategy which takes care of everything. What’s the best practice for Pytorch?