Loading of duplicated data in distributed train

Innisfree · March 4, 2020, 2:55pm

Hi, I would like to pretrain BERT by using DDP.
I saved pretrain dataset(350GB of large corpus) as torch.tensor.

When I run the code below, dataset is loaded in memory 8 times.
python -m torch.distributed.launch --nproc_per_node=8 train.py

How can I prevent it?
Thanks.

ptrblck · March 5, 2020, 8:01am

Did you store the complete dataset in a single tensor?
If so, I think you might need to load it once and store smaller chunks of the data (and load only certain chunks in each process) or load the data lazily from the beginning.

Innisfree · March 5, 2020, 12:07pm

Yes, I did
As you said, I stored smaller chunks of the data.

Thanks for your reply.

mrshenli · March 6, 2020, 4:26pm

I thought it is expected to have dedicated data loader in each process? So that 8 processes will have 8 dataloaders and 8 DDP instances?

cc @vincentqb please correct me if I am wrong.

vincentqb · March 6, 2020, 10:25pm

Right, depending on the details in the code is organized, I would expect 8 processes/gpus getting different chunk of data as you said.

Messiah · August 6, 2022, 9:46pm

Hi, I am facing a similar issue as you describe.

May you elaborate on how you went about storing smaller chunks of data and loading these in each process, as @ptrblck mentions?