DDP without DistributedSampler to avoid dataset multiple loading

stonez · April 3, 2024, 3:48am

Hi, I am working on the following problem:

Assume the training set has a relatively large amount of data samples, say 100,000, and we want to run DDP on this training set. Suppose we are running on 8 gpu cores.

The common way to process the training set is that you have a meta file recording, say, the path to each data sample, then you create a dataset that only load the meta file when initializing, and lazy load the data sample during training time. In this case, you can create the dataset on every gpu core and get a DistributedSampler to separate the dataset.

However, if we want to accelerate the training speed by loading all the data samples to RAM first before the training process (opposite to lazy load), then when creating the dataset I think we are duplicating it to every gpu core. In my experiment the preloaded data is taking 20+GB RAM, so loading it 8 times are consuming more than 160GB RAM, which is crazy.

My question would be, if it is a valid practice that before all the data loading happens, I compute on gpu core 0 how I split the training data to 8 folds, then I explicitly initiate the dataset on every gpu core with its assigned fold. In this way each preloading data set on each core only take roughly 1/8 of the full data size, so I can reduce the amount of total RAM I need when submitting my job from 160GB + to 20GB+.

I haven’t tried any of this approach before, and I am not sure if it would break any DDP rules we need to follow. I really appreciate any thoughts or suggestions, thanks!

This is what the dataset class looks like:

class ToySet(Dataset):
    def preload_data(self):
        local_list = []
        for idx, row in self.meta.iterrows():
            local_list.append(get_record(row))  # load data to RAM

        return local_list

    def __init__(self, args, logger):
        rank = args.local_rank
        num_gpus = int(os.environ["LOCAL_WORLD_SIZE"])
        meta = pd.read_csv(f"meta.csv")
        total_size = meta.shape[0]
        block_size = total_size // num_gpus

        self.logger = logger

        self.meta = meta[rank * block_size: (rank + 1) * block_size] if rank < num_gpus-1 else meta[rank * block_size:]

        self.logger.info(f"Dataset: local rank {rank} preloading")
        self.datalist = self.preload_data()

xgbj · April 3, 2024, 7:10am

Hello, the concern you mentioned should not exist. First of all, currently using DDP should only result in duplicate loading of meta data, which should have minimal impact. Secondly, due to the existence of the operating system’s page cache, you actually don’t need to explicitly load the dataset into memory. When your machine has enough memory to hold all the data, accessing the data for the second time will only access the memory, without any additional performance loss.

stonez · April 3, 2024, 4:37pm

Hi, thanks for your reply!

If I understand it correctly, you are saying that lazy loading will only lower the running speed of the first epoch, and for the rest epoch the data should be cached in the memory for fast reading. Therefore there is no need to manually load data to ram before training.

I am not sure about the RAM behavior of the online cluster I am using, but I do add pin_memory=True in the datal loader I will try to see if lazy loading can work efficiently

xgbj · April 4, 2024, 2:01pm

Pinned memory and page cache are two different concepts, and using pinned memory generally has an acceleration effect on dataloaders. However, their principles are different from each other.

Pinned memory refers to locking memory in physical memory to prevent it from being swapped to disk by the operating system. It is commonly used in GPU programming to improve data transfer efficiency.

On the other hand, page cache is a part of memory used by the operating system to cache file data. It is used to improve the performance of file reading and writing operations.

While both pinned memory and page cache can contribute to improving data transfer and file access performance, they operate at different levels and have different underlying mechanisms.