Memory error on ONE GPU destribution on the CPU befor moving the data

I’m trying to train a complex models on very large molecular dataset, each row on this dataset is tuple constructed of (drug graph, protein graph, label, Demb, Aemb), the type of the graphs are a PyG data and the embeds are a dictionaries, to accelerate the training of the models I tried to use the DDP on one GPU (so I’m not using the spawn), but when doing that I get a Memory Error / OOM killed / or too many open files the code is organized as the followings:

def main () : 
# Dist. init_process_group ()
    # if rank = 0 
    # load a metadata file containing the ids 
    # split the data into train test validation 
    # broadcast_object_list of the train and validation set 
    # ALL ranks execute 
    # load the file  that contains the graphs and the Embedding information (more than 13G), retrieved using the    identifiers to the CPU 
   # for Each pair in the metadata retrieves the information from the files to construct the tuple (drug graph, protein graph, label, demb, aemb)
   #******* some time the code comes into this point and failed ***# 
   # initialise the model and move it to the GPU 
   #  model = DDP (model, device_ids=[current_device], broadcast_buffers=False)
  #  initialise the optimiser + criterion 
  # customed Dataset class initialization 
  # DistributedSampler 
  # Dataloader 
  # training 
  # validation 

I would now if I’m using the DDP in wrong ways that it causes the problem, and it seems that each rank duplicate the dataset which is already large and it causes the problem, and secondly if the problem comes from this point what is the solution:

  1. To load the data only on rank = 0 and then distributed it, how can we do that without stoking copies into the memory
  2. If we used a shared Memory, we can get a problem with the access
    I hope you can help me with these, I can provide any additional problem

Would it be possible to lazily load the data on each rank (which is the common approach using DDP)?

In this case, is there any common solution to this problem in PyTorch?

I’m not sure I understand your question. I asked if you have considered to lazily load the data (which is possible in PyTorch) and if this would be a possible approach.
If you are looking for an example check the Dataset tutorial.