I’m running DDP training on a cluster with time limit. When the time limit hits, I have to checkpoint model, optimizer etc.'s states, and resubmit a job by loading these states.
Other than these regular states, if I want the training curve to be exactly the same as if there were no time limit. I have to save random generator’s states as well. I tried the following (the actual code differs, but all key components are included below):
Saving:
# Each rank does the following
rng_state_dict = {
'cpu_rng_state': torch.get_rng_state(),
'gpu_rng_state': torch.cuda.get_rng_state(),
'numpy_rng_state': numpy.random.get_state(),
'py_rng_state': random.getstate()
}
torch.save(rng_state_dict, f'rng_state_{rank}.ckpt')
Loading:
# Assume main.py already knows its local_rank and (global) rank
# At the very beginning of main.py: each rank does the following
torch.cuda.set_device(local_rank)
rng_state_dict = torch.load(f'rng_state_{rank}.ckpt', map_location='cpu')
torch.set_rng_state(rng_state_dict['cpu_rng_state'])
torch.cuda.set_rng_state(rng_state_dict['gpu_rng_state'])
numpy.random.set_state(rng_state_dict['numpy_rng_state'])
random.setstate(rng_state_dict['py_rng_state'])
I use the above strategy to save and resubmit job to continue training. However, the training curve differs from the case where there is no time limit. I wonder what piece is missing in my code?