DistributedSampler seed on spot instances

bhack · March 6, 2024, 12:18pm

I have a question about correctly handle DDP with spot instance.

Using torch.utils.data.distributed.DistributedSampler(my_dataset) it is going to set seed=0 (its default for every GPU process).

From the doc:

This number should be identical across all processes in the distributed group.

The problem with spot instance is that the training could be restarted multiple times for preemption so with the default seed value are we going to repeat the initial dataset sequence when the training job is going to restart from the last available checkpoint?

bhack · March 6, 2024, 3:22pm

I don’t know if we want to merge this with How to save distributedSampler state for resuming later? at it could be similar.
But the reply there was not so clear about the internals.