Request for make PyTorch more multithread friendly - make pseudo r.n. generators part of the model

Hello. I would like to ask to append into PyTorch mechanism to control the seed of random number generators. For people who try use PyTorch from the multi-threaded application, they can face the problem that if several threads performs model initialization with randomized schemas (He initialization, Lecun initialization, etc.) it’s very likely that models this initialization is really a part of global state for which users from Python API has only access via functions like TORCH.MANUAL_SEED.

It’s better to redesign, and make random generator state as a part of the model I think.


Code snippet which demonstates problems with global states for numpy.random and Python’s random.


#!/usr/bin/env python3

import numpy as np
import random
import threading, time

class WorkerThread(threading.Thread):

def __init__(self, i, sleep_seconds):
    threading.Thread.__init__(self)
    self.th_number = i
    self.sleep_seconds = sleep_seconds

def run(self):
    np.random.seed(123)
    random.seed(123)

    time.sleep(self.sleep_seconds)
    print(self.th_number, np.random.random(), "(np.random)")
    print(self.th_number, random.random(), "(random)")

th = [WorkerThread(k, 1*k) for k in range(3)]
for t in th: t.start()
for t in th: t.join()


This known limitation is caused by the forked subprocesses and third party libraries as given by your code snippet and explained in the FAQ as well as the Randomness docs.
While the DataLoader already sets the seed for the random module (as well as PyTorch), note that numpy is not a requirement, which is why previous suggestions to seed third party libraries in PyTorch’s code were declined.

Thanks, that you are talking about is very interesting and thanks for reference, but I did not mean DataLoaders and forking strategy during having parallel threads.

I mostly care about the initialization of the models and have the ability to specify the initialization of the model in a thread-safe way.

So my suggestion append into Public API this controlling seeds technics at the level of Models, if it is possible for cases when people do not use DataLoader's.

Very often for low-level libraries - memory and logging is user-specfied callbacks, but for PyTorch in my opinion this random seeds should be controlled with allow user to control and it’s important thing for me as user.