Does DDP send full dataset to each GPU?

Hi,
I think that I have a really simple question for most of you. I am trying to create a Masked Language Modeling script with DDP, it means that I am tokenizing and masking large portion of text.
The problem I have is that I do not understand one aspect of DDP. I thought that only one process stores full data and then it distributes mini batches to each GPU.

For 3000 examples, when I do:
len(train_dataloader) * num_gpus * batch_size I get 2100 (70% of full dataset is training data). I understand this fragment of text. When I spawn the process on each GPU, the dataloaders store mini batches of data.

However, when I execute:
len(train_dataloader.dataset) + len(eval_dataloader.dataset)
Then I receive 3000, which is the size of my full dataset.

How does it exactly work? Each GPU receives full dataset, and then DistributedSampler distributes everything in the DataLoader (thanks to that each DataLoader passes only specific portion of data) but no matter what DataLoader.dataset will contain full data? Can someone explain this aspect more clearly to me? Does it mean that DDP is not that memory-efficient, but only enables faster training thanks to gradient averaging across multiple GPUs?

DDP uses a single process per device in the recommended setup, which means that each process is responsible for loading and processing its own dataset.
You would usually use a DistributedSampler which splits the indices based on the number of ranks to avoid using the same data.

Thank you for your answer. I am using DistributedSampler, and that’s why each DataLoader passes its own data (even when it stores the full dataset when checking dataset argument.
However, I am using spawn method and initializing training for couple of processes. I also have functions that are loading the dataset inside of the training script. Does it mean that it allocates the full dataset on every GPU?

If you are eagerly preloading the data, each process would also hold it, yes. The common approach is thus to lazily load the data. The data won’t be automatically moved to the GPU and will reside where originally loaded, so the CPU in the common approach.

Thank you. Could you explain what does lazy loading exactly mean? I cannot find exact information about using it in the Deep Learning field.

This tutorial explains how to write a custom Dataset which lazily loads the data.
Basically you are not preloading all samples during the creation of the Dataset, but are only loading the paths and assign the transformations in the __init__ method. While iterating the Dataset directly or via a DataLoader the Dataset.__getitem__ method is used to load and process each sample, which is referred to as lazy loading.