ShufflerIterDataPipe but different random seed per distributed rank

AlbertZeyer · November 6, 2024, 12:53pm

As I understand the (somewhat obfuscated logic) of DataLoader iterators and specifically what this means to ShufflerIterDataPipe if I have that in my data pipeline:

Creating a new iterator on DataLoader will get a new random number from the current default Torch RNG, and use this as seed for all data pipes. This seed is broadcasted from rank 0 to all other workers, i.e. all ranks use the same seed. It then calls torch.utils.data.graph_settings.apply_random_seed, which will call set_seed on all random data pipes (i.e. also on my ShufflerIterDataPipe).

Calling __iter__ on any of the data pipes will then call reset() (via a very obfuscated way…) and there actually use this seed and (re)initialize any own internal RNGs. This is also what ShufflerIterDataPipe does.

However, what I want is to have actually a different random seed per worker, or in general more control over the random seed, e.g. making it also dependent on the current epoch, or so. How would I do that?

I also understand that these data pipes are deprecated (probably for good reason, the code is quite difficult to read). So what would be the alternative? I need sth like ShufflerIterDataPipe.