I understand that with PyTorch DDP, each process loads its own instance of data from the disk. However, my dataset is very large (a very large parquet file that loads into a dataframe) and I can’t have each process load it into memory with limited RAM capacity. Is there a shared memory implementation so that one process loads the data into RAM and then each process uses the same loaded data from the first process?
I also thought of splitting but I can’t split (.iloc) data until after all the data is loaded.
Does your use case need the entire dataset to be available in RAM to start off training? For DDP and general PyTorch training, usually it’s doing batched gradient descent and it only needs a batch of data available in the memory at a time, so the memory requirement on each worker should be fairly small when using PyTorch’s dataloader, could you try streaming the parquet file and use PyTorch’s dataloader to load the data?
Yes, the data needs to be in RAM. That’s what makes my application complicated.
How do I stream a parquet file?
I am also looking for a solution to this problem. If anybody knows about this, please help!
Hi, I guess there are two things you can try:
- You can try the distributed sampler.
- If it does not work for you, you probably need to have your customized logic for the data loading. cc: @ejguan @VitalyFedyunin
torchdata does provide a
DataPipe to stream parquet file. data/dataframemaker.py at 2dfdcfb861971f08d5a0673ce17d6a979bc5de8d · pytorch/data · GitHub
But, it requires some implementation changes on your data pipeline. We proposed to make data pipeline using composable
DataPipes. You can find more detailed doc in Tutorial — TorchData 0.5.0 (beta) documentation