As I understand the (somewhat obfuscated logic) of DataLoader
iterators and specifically what this means to ShufflerIterDataPipe
if I have that in my data pipeline:
Creating a new iterator on DataLoader
will get a new random number from the current default Torch RNG, and use this as seed for all data pipes. This seed is broadcasted from rank 0 to all other workers, i.e. all ranks use the same seed. It then calls torch.utils.data.graph_settings.apply_random_seed
, which will call set_seed
on all random data pipes (i.e. also on my ShufflerIterDataPipe
).
Calling __iter__
on any of the data pipes will then call reset()
(via a very obfuscated way…) and there actually use this seed and (re)initialize any own internal RNGs. This is also what ShufflerIterDataPipe
does.
However, what I want is to have actually a different random seed per worker, or in general more control over the random seed, e.g. making it also dependent on the current epoch, or so. How would I do that?
I also understand that these data pipes are deprecated (probably for good reason, the code is quite difficult to read). So what would be the alternative? I need sth like ShufflerIterDataPipe
.