Efficient way for deterministic data augmentation

NightRain · October 28, 2020, 2:23pm

What is the most efficient way to have deterministic data augmentation (i.e. transformations every epoch are random, however they can be reproduced reliably for every data point)?

Currently I am thinking of creating a list with a numpy RandomState object for every datapoint. Even if the DataLoader uses multiple processes, every object is called once per epoch, so every datapoint is subject to the exact same random transformations when e.g. restarting training from scratch (assuming the RandomState objects are reinitialized with the same seed). One RandomState is not enough as there would be multiple processes (num_workers > 0) accessing it and the datapoints are shuffled every epoch.

Is there a more efficient way to do this, considering multiple processes applying a random transformation every epoch to the objects where their order changes due to the shuffling?

ptrblck · October 29, 2020, 7:05am

I think one good approach would be to use the worker_id and seed all 3rd party libraries with this id.
You can get the id via torch.utils.data.get_worker_info().
PyTorch itself should yield deterministic data samples, if you’ve properly set the seed before.
However, if you are using e.g. the Python random library or numpy in the data loading pipeline, you would have to seed them as well using the aforementioned approach.