I’ve seen various examples using DistributedDataParallel
where some implement the DistributedSampler
and also set sampler.set_epoch(epoch)
for every epoch in the train loop, and some that just skip this entirely. Why is this, and is it really needed for the distributed training to execute correctly?
1 Like
Based on the docs it’s necessary to use set_epoch
to guarantee a different shuffling order:
In distributed mode, calling the
set_epoch()
method at the beginning of each epoch before creating theDataLoader
iterator is necessary to make shuffling work properly across multiple epochs. Otherwise, the same ordering will be always used.
1 Like