How to stream a folder of CSV data for multi-GPU training using DistributedSampler?

shengy90 · June 19, 2023, 6:30pm

Hi I have a folder of a lot of CSV files which cannot be loaded into memory. How do I load this data into multi-GPU training?

What I have done

At the moment, I’m loading this data into my training application using TorchData datapipes (Example tutorial).
As my full production dataset is too big to fit all into memory and I’d like to accelerate my training, I’d like to train it using multiple GPUs, which each GPU only reading a subset of the data.
To do that, I’m trying to use DistributedSampler alongside with DistributedDataParallel strategy

However, I’m getting the error " MapperIterDataPipe has no len" which of course it hasn’t because the full data size is not known in advanced. What’s the best strategy then to stream CSV records into Distributed GPU training in Pytorch?

In tensorflow, there’s a DistributedDataset and DistributedMirrorStrategy which takes care of everything. What’s the best practice for Pytorch?

shengy90 · June 20, 2023, 8:30am

I also attempted to use Pytorch Lightning to do MultiGPU training, and then discovered that Pytorch Lightning doesn’t automatically distribute data across multiple GPU devices for IterableDatasets. Pytorch Lightning only automatically does so for Map Datasets.

Wondering what’s the best practice for production datasets (very large data, sharded across many CSV files) + MultiGPU training in Pytorch!