I have a question about correctly handle DDP with spot instance.
Using torch.utils.data.distributed.DistributedSampler(my_dataset)
it is going to set seed=0
(its default for every GPU process).
From the doc:
This number should be identical across all processes in the distributed group.
The problem with spot instance is that the training could be restarted multiple times for preemption so with the default seed value are we going to repeat the initial dataset sequence when the training job is going to restart from the last available checkpoint?