Loading of duplicated data in distributed train

Hi, I would like to pretrain BERT by using DDP.
I saved pretrain dataset(350GB of large corpus) as torch.tensor.

When I run the code below, dataset is loaded in memory 8 times.
python -m torch.distributed.launch --nproc_per_node=8 train.py

How can I prevent it?
Thanks.

Did you store the complete dataset in a single tensor?
If so, I think you might need to load it once and store smaller chunks of the data (and load only certain chunks in each process) or load the data lazily from the beginning.

1 Like

Yes, I did
As you said, I stored smaller chunks of the data.

Thanks for your reply.

I thought it is expected to have dedicated data loader in each process? So that 8 processes will have 8 dataloaders and 8 DDP instances?

cc @vincentqb please correct me if I am wrong.

Right, depending on the details in the code is organized, I would expect 8 processes/gpus getting different chunk of data as you said.

Hi, I am facing a similar issue as you describe.

May you elaborate on how you went about storing smaller chunks of data and loading these in each process, as @ptrblck mentions?