I understand that with PyTorch DDP, each process loads its own instance of data from the disk. However, my dataset is very large (a very large parquet file that loads into a dataframe) and I can’t have each process load it into memory with limited RAM capacity. Is there a shared memory implementation so that one process loads the data into RAM and then each process uses the same loaded data from the first process?
I also thought of splitting but I can’t split (.iloc) data until after all the data is loaded.
Does your use case need the entire dataset to be available in RAM to start off training? For DDP and general PyTorch training, usually it’s doing batched gradient descent and it only needs a batch of data available in the memory at a time, so the memory requirement on each worker should be fairly small when using PyTorch’s dataloader, could you try streaming the parquet file and use PyTorch’s dataloader to load the data?
Yes, the data needs to be in RAM. That’s what makes my application complicated.
How do I stream a parquet file?