DistributedSampler is NOT required for FSDP on a single cluster

I understand the need for a DistributedSampler in conjunction with FSDP for multi-cluster training, because the state of the dataset is not shared across instances, and therefore the same data points will inevitably be yielded.

I also understand the need for a DistributedSampler in conjunction with DDP, as its benefits arise from aggregating results on different batches in parallel.

However, I do not think a DistributedSampler is necessary with iterable datasets on a single multi-GPU cluster.

Correct me if I am wrong.

I have performed experiments that prove that a distributed sampler is indeed required for FSDP on a single multi-GPU cluster.

The reason for this investigation is the requirement for an iterable dataset (and not a map style) with which the PyTorch DistributedSampler is not compatible. I have created an iterable dataset that borrows the distributed sampler logic and ensures distributing non-duplicated data across ranks.