Dataloader state recover and doubts with worker_init_fn

Dhorka · October 30, 2018, 8:43pm

Hi,

I have several doubts with the dataloader that maybe you can help me to clarify. The first doubt is how to recover the state of the dataloader when the shuffle=true. I mean, imagine we are saving for each epoch the state of the whole trainning process. For weird circumstances we have a problem with our computer and we want to re-start the training using a previous epoch state. How can we recover the state of the dataloader? At this moment the only way that I have on my mind is, in the moment of the resume, iterate over the dataloader for epochs_ocurred on this state. Imagine we save the epoch two, then using my method we need to iterate over the dataloader two times in order to start on the epoch 3 and have the same state. I assuming that we are saving the seed and setting using torch.manual_seed. Is there another way to achieve that?

The second doubt is related to worker_init_fn. Here I have several doubts. The first one is related to the base seed used. If I didn’t declare any method and in my main program I used tourch.manual_seed to configure a seed, this “base seed” used by the worker_init_fn is related to the torch.manual_seed that I set before,is it? Moving to the next doubt. Imagine that inside of the loader I am doing some operations with numpy and I would like to be able to reproduce the experiment or continue after an interruption. To achieve this behavior I suppose that I need to create my own init_fn. For instance this one:

def _init_fn(seed+worker_id):
   np.random.seed(seed+worker_id)
   torch.manual_seed(seed+worker_id)

Far as I understood when I am using my own init_fn I need to set again the seed for torch api, right? Because the default function will be overwritten for my own function. With this function, I am able to set the seed for each worker in order to be able to reproduce the experiment. The first doubt regarding this function is related to worker_id, how can I obtain this worker_id? The second doubt is more related to the functionality of the worker. Far as I understood for each epoch the workers are restarted, then the init_fn will be executed again with the issue that the same random numbers will be generated again, right? How can I do to seed the workers only one time, and obtain different numbers in each epoch?

Thanks for your time.