Why does torch.mp.spawn take longer with larger datasets?

Benjamin_Shwartzman · March 23, 2023, 2:45am

I am currently training a model using DDP using 4 GPUs on 32 vcpus.

I have constructed my dataset to be a list of strings where each string represents a local path where the numerical data is stored.

When I run my code on a small dataset (e.g. 13 paths), torch.mp.spawn is extremely fast. The 4 processes are spawned extremely quickly and training beginnings. However, when I try to run my code on the full dataset (size of 2000), torch.mp.spawn takes multiple hours to complete. I’m confused as to why this is the case. I don’t understand why mp.spawn is dataset size dependent since the dataset is not passed as an argument for the spawn method. Any insight to this problem would be greatly appreciated.