I understand the need for a DistributedSampler
in conjunction with FSDP
for multi-cluster training, because the state of the dataset is not shared across instances, and therefore the same data points will inevitably be yielded.
I also understand the need for a DistributedSampler
in conjunction with DDP
, as its benefits arise from aggregating results on different batches in parallel.
However, I do not think a DistributedSampler
is necessary with iterable datasets on a single multi-GPU cluster.
Correct me if I am wrong.