Hi, I am working on the following problem:
Assume the training set has a relatively large amount of data samples, say 100,000, and we want to run DDP on this training set. Suppose we are running on 8 gpu cores.
The common way to process the training set is that you have a meta file recording, say, the path to each data sample, then you create a dataset that only load the meta file when initializing, and lazy load the data sample during training time. In this case, you can create the dataset on every gpu core and get a DistributedSampler to separate the dataset.
However, if we want to accelerate the training speed by loading all the data samples to RAM first before the training process (opposite to lazy load), then when creating the dataset I think we are duplicating it to every gpu core. In my experiment the preloaded data is taking 20+GB RAM, so loading it 8 times are consuming more than 160GB RAM, which is crazy.
My question would be, if it is a valid practice that before all the data loading happens, I compute on gpu core 0 how I split the training data to 8 folds, then I explicitly initiate the dataset on every gpu core with its assigned fold. In this way each preloading data set on each core only take roughly 1/8 of the full data size, so I can reduce the amount of total RAM I need when submitting my job from 160GB + to 20GB+.
I haven’t tried any of this approach before, and I am not sure if it would break any DDP rules we need to follow. I really appreciate any thoughts or suggestions, thanks!
This is what the dataset class looks like:
class ToySet(Dataset):
def preload_data(self):
local_list = []
for idx, row in self.meta.iterrows():
local_list.append(get_record(row)) # load data to RAM
return local_list
def __init__(self, args, logger):
rank = args.local_rank
num_gpus = int(os.environ["LOCAL_WORLD_SIZE"])
meta = pd.read_csv(f"meta.csv")
total_size = meta.shape[0]
block_size = total_size // num_gpus
self.logger = logger
self.meta = meta[rank * block_size: (rank + 1) * block_size] if rank < num_gpus-1 else meta[rank * block_size:]
self.logger.info(f"Dataset: local rank {rank} preloading")
self.datalist = self.preload_data()