Data loaded multiple times with DistributedDataParallel

Sohrab_Andaz · September 24, 2021, 7:02pm

I have a fairly simple training script that

Reads data from parquet into a pandas DF
Pushes data into a torch tensor
Uses TensorDataset/DistributedSampler/DataLoader to load data during training
Uses DistributedDataParallel to manage distributed training across GPUs of a single instance.

However I know that when i call mp.spawn(train, nprocs=args.gpus, args=(args,)) the code to read my feature and label data is executed in each process. I’m sure this causes some sort of unnecessary memory/CPU overhead on the machine. Is there any obvious way to avoid this?

Thanks so much!
-Sohrab Andaz