I’m trying to train a complex models on very large molecular dataset, each row on this dataset is tuple constructed of (drug graph, protein graph, label, Demb, Aemb), the type of the graphs are a PyG data and the embeds are a dictionaries, to accelerate the training of the models I tried to use the DDP on one GPU (so I’m not using the spawn), but when doing that I get a Memory Error / OOM killed / or too many open files the code is organized as the followings:
def main () :
# Dist. init_process_group ()
# if rank = 0
# load a metadata file containing the ids
# split the data into train test validation
# broadcast_object_list of the train and validation set
# ALL ranks execute
# load the file that contains the graphs and the Embedding information (more than 13G), retrieved using the identifiers to the CPU
# for Each pair in the metadata retrieves the information from the files to construct the tuple (drug graph, protein graph, label, demb, aemb)
#******* some time the code comes into this point and failed ***#
# initialise the model and move it to the GPU
# model = DDP (model, device_ids=[current_device], broadcast_buffers=False)
# initialise the optimiser + criterion
# customed Dataset class initialization
# DistributedSampler
# Dataloader
# training
# validation
I would now if I’m using the DDP in wrong ways that it causes the problem, and it seems that each rank duplicate the dataset which is already large and it causes the problem, and secondly if the problem comes from this point what is the solution:
- To load the data only on rank = 0 and then distributed it, how can we do that without stoking copies into the memory
- If we used a shared Memory, we can get a problem with the access
I hope you can help me with these, I can provide any additional problem