What memory is used when downloading data on each rank - multiple GPUs

Typical DDP setups use a DistributedSampler to avoid reusing the same samples and to split them among the ranks. Are you using such a sampler already?