PyTorch DDP but with only one process to load data

amirhf · July 26, 2021, 7:52pm

Hello,

I understand that with PyTorch DDP, each process loads its own instance of data from the disk. However, my dataset is very large (a very large parquet file that loads into a dataframe) and I can’t have each process load it into memory with limited RAM capacity. Is there a shared memory implementation so that one process loads the data into RAM and then each process uses the same loaded data from the first process?

I also thought of splitting but I can’t split (.iloc) data until after all the data is loaded.

wanchaol · July 27, 2021, 1:58am

Does your use case need the entire dataset to be available in RAM to start off training? For DDP and general PyTorch training, usually it’s doing batched gradient descent and it only needs a batch of data available in the memory at a time, so the memory requirement on each worker should be fairly small when using PyTorch’s dataloader, could you try streaming the parquet file and use PyTorch’s dataloader to load the data?

amirhf · July 27, 2021, 8:30pm

Yes, the data needs to be in RAM. That’s what makes my application complicated.
How do I stream a parquet file?

f10w · November 12, 2022, 3:50pm

I am also looking for a solution to this problem. If anybody knows about this, please help!

fduwjj · November 15, 2022, 8:07pm

Hi, I guess there are two things you can try:

You can try the distributed sampler.
If it does not work for you, you probably need to have your customized logic for the data loading. cc: @ejguan @VitalyFedyunin

ejguan · November 15, 2022, 9:14pm

I think torchdata does provide a DataPipe to stream parquet file. data/dataframemaker.py at 2dfdcfb861971f08d5a0673ce17d6a979bc5de8d · pytorch/data · GitHub

But, it requires some implementation changes on your data pipeline. We proposed to make data pipeline using composable DataPipes. You can find more detailed doc in Tutorial — TorchData 0.5.0 (beta) documentation

cc: @nivek