What happens when we do not give a distributed sampler? Does it essentially iterate over all samples with as many ranks as we have?
If the data is not sharded across different DDP ranks (i.e. with a distributed sampler or some custom sharding logic that you may have), then yes, DDP will use all samples on all ranks (in your example I guess there’s 2 ranks).
This is why in general you want want to partition your data appropriately across ranks to ensure different model replicas get different data.
Thanks Rohan, this cleared it out for me.