Fix seed for data loader

cbd · March 17, 2022, 10:49am

I am little bit confuse with the data loader and number of workers. Lets say I have 100 images in my dataset and used shuffle=True. I run the code for 20 epochs. In each epoch it randomly shuffle the 100 images.

(1) If i run the code again for 20 epochs, how to make it follows the same shuffle as previous run?
(2) Lets the number of workers=2 and batch_size =8. Is it mean each worker will contribute to 4 images in batch?
(3) If batch_size=1 and number of workers=2, Is it actually require one worker as batch_size=1? If no how the data process?
(4) If number of worker=0, code works fine. Where actually it contribute?

ptrblck · March 17, 2022, 6:55pm

You could reset the seed via torch.manual_seed either before starting the new epoch or probably also by recreating the DataLoaders and seed the workers in the worker_init_fn.
Each worker will load the entire batch of 8 samples.
I don’t understand this question so could you explain it a bit more?
num_workers=0 will load the data in the main process and will not spawn background processes.

cbd · March 20, 2022, 3:48pm

Thanks. If i re run for 20 epoch, it shuffle as it do for first run. Now consider 2 cases.
Case 1: Model run for total 20 epochs.
Case 2: Let say training stop at epoch-6 and model is saved at epoch-5. When i load the epoch-5 saved model and start continue training, it follow the shuffling of data of epoch-6 as epoch-1, epoch-7 as epoch-2 and so on. It does not follow the shuffling of epoch-6 as in Case 1. I need same shuffling as in Case 1 even if i start continue training with saved model.

If i am not wrong, num_workers initiate background process and keep dataset of mentioned batch_size ready to speed up the process. Now please correct me, if at a time one batch is input to model in case of single GPU, there is no need to specify more than 1 number of workers. Is it?

I think your answer:2 is applied to question:3. Each worker create batch_size=1.

ptrblck · March 20, 2022, 9:43pm

To to able to “reset” the seed the original training method could already set the seed in each epoch so that you could just use the same workflow. I.e. training in epoch 0 uses seed=0, training in epoch 1 uses seed=1, etc.

Depending on your system it could be beneficial to create more processes. Even if a single batch is ready, the other workers would still preload the next batch and add them to the queue. You can experment with different number of workers and see where the sweet spot for your system and workload is.

If batch_size=1 is used in the DataLoader, then yes.
Otherwise each worker will create an entire batch (i.e. it will load and process batch_size samples):

cbd · March 21, 2022, 2:40am

The question is when i run the model for 20 epoch(training not stop in beetween) it shuffle the images in each epoch. I want same shuffling if training stop and load the save model of previous epoch. Now problem comes when i load the saved model and start training. Let say training stop after saving the model of epoch-5.

Now if i load the saved model and start training from epoch-6, shuffling is as per epoch-1. Epoch-7 follows the shuffling as per epoch-2 and so on. What i want is shuffling of Epoch-6 as i run the model for 20 epoch(training not stop in beetween). In short if i start training after loading saved model, it reset the seed even if i start from epoch-6. I hope question is clear now.

ptrblck · March 21, 2022, 2:46am

I am understanding the question and don’t think there is a way other than to:

already seed the data loading during training and just doing the same when resuming the training
rerunning the training.

Even if you already set a global seed during the training once, one might think you could just iterate the DataLoader for x epochs before continuing the training in epoch x+1. However, while this approach might work if random operations are only used in the data loading, all calls to the PRNG in the model (e.g. in dropout layers) are missed and you thus won’t get the same results.

cbd · March 21, 2022, 5:59am

Thanks. I think i am near to what i want. Can u tell me where i can plug the code 1 in my code 2 below.

Code 1: 
for _ in range(epochs_to_restore):
    for batch in loader:
        pass

Code 2: 
for epoch in range(0,num_epoch):

   for i, data in enumerate(dataset):

ptrblck · March 21, 2022, 5:37pm

You can run code 1 right before starting the training in code 2, if you’ve set the global seed in both use cases at least once or if your model does not contain any random ops.