Why is 'sampler.set_epoch(epoch)' needed for DistributedSampler?

mesllo · April 20, 2022, 5:22pm

I’ve seen various examples using DistributedDataParallel where some implement the DistributedSampler and also set sampler.set_epoch(epoch) for every epoch in the train loop, and some that just skip this entirely. Why is this, and is it really needed for the distributed training to execute correctly?

ptrblck · April 21, 2022, 5:02am

Based on the docs it’s necessary to use set_epoch to guarantee a different shuffling order:

In distributed mode, calling the set_epoch() method at the beginning of each epoch before creating the DataLoader iterator is necessary to make shuffling work properly across multiple epochs. Otherwise, the same ordering will be always used.