I’ve seen various examples using DistributedDataParallel where some implement the DistributedSampler and also set sampler.set_epoch(epoch) for every epoch in the train loop, and some that just skip this entirely. Why is this, and is it really needed for the distributed training to execute correctly?
1 Like
Based on the docs it’s necessary to use set_epoch to guarantee a different shuffling order:
In distributed mode, calling the
set_epoch()method at the beginning of each epoch before creating theDataLoaderiterator is necessary to make shuffling work properly across multiple epochs. Otherwise, the same ordering will be always used.
4 Likes