Reproducibility over multiGPUs is impossible until randomness of threads is controled, and yet

sbelharbi · June 5, 2019, 12:14am

Did anyone succeed to reproduce their code when using multiGPUs?
If yes, could you share how you did it? (general idea)

My code is totally reproducible when using one single GPU (independently of the number of workers >= 0); however, it loses its reproducibility when using multiple GPUs.
Randomness on samples (such as transformations) is controlled (fixed using seeds for each sample).
I use torch.nn.DataParallel to wrap the model for multiGPUs.

Who to blame for this non-reproducibility in the case of multiGPUs? atomic operations (I hope not)?

Pytorch reproducibility note.
I use Pytorch 1.0.0, Python 3.7.0.

I use standard guide to fix the seeds of the modules:

torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)  # for multiGPUs.
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

Thanks!

sbelharbi · June 5, 2019, 10:41am

Any news on this?
Thank you!

sbelharbi · June 17, 2019, 9:46pm

+++++++++++++++++++++++++++++++++++++++++++++++++++++++

                      PAGE 1/5

+++++++++++++++++++++++++++++++++++++++++++++++++++++++

Code tested using: Pytorch (1.0.0)/Python 3.7.0, over K=2 GPUs.

GIST-of-GIST:

Depending on what your code is doing, it is possible to make the code reproducible over multigpu for a constant number of gpus K (i.e., results are reproducible only and only over K.)
Controlling the randomness of each thread over each GPU allows reproducibilit (1) (conceptually, and over a synthetic code). However, using a real code, I was unable to obtain stable/reproducible results (sometime I obtain 100% reproduble results, other time, not the case). This must have something to do with the sensitivity of torch.nn.CrossEntropyLoss to the unstability/randomess of F.interpotale in my code as in this issue.
In order to control randomness in threads when using torch.nn.DataParallel, we propose to use:
a. threads.Lock to lock down the random regions within each thread to be thread-safe. (threads share random generators).
b. Re-seed each thread separately. One can go further and pass around random prng state for each thread.
I spent a lot of time of this issue. I decided to put this matter to bed for now. I hope that the dev-team will power Pytorch with simpler and more efficient tools for reproducibility over single and multiple gpus.
Related: 1, 2.
You may consider disabling Cudnn, if you are unable to get reproducible results: torch.backends.cudnn.enabled = False.

GIST:

After some digging, I came to this:

If the forward function (or other functions) that you parallelized using
torch.nn.DataParallel contains random instructions such as dropout, reproducibility over multiGPUs in Pytorch (1.0.0)/Python 3.7.0 is impossible for whatever number of GPUs K (in a sens, you can not obtain the same results for K=1, AND K=2, AND K=3, AND K=4, …).
Why? because of multithreading which is non-deterministic by definition. torch.nn.parallel.parallel_apply.py
uses threads. Threads share the memory between them (which includes random generator). Do you see now the problem? in our case, threads are truly running in parallel since each one is running over a GPU device (assuming there is only GPU instruction; and no CPU instructions). You can not know with certainty the order of the execution of the threads. It is up to the scheduler of OS’ kernel at runtime. Therefore, you can not determine the order of the threads’ calls to the random generator. Each call will change its internal state. As consequence, you can not determine at what state the thread i will call the random generator.
The good news is that you can make your code reproducible only for K GPUs, in a sens, that the results obtained at K GPUs can be reproducible ONLY when using K GPUs.
Reproducibility in Pytorch still needs a lot of work.
Nightly build version 1.2.0.dev20190616 (https://download.pytorch.org/whl/nightly/cu100/torch_nightly-1.2.0.dev20190616-cp37-cp37m-linux_x86_64.whl) seems to have fixed a huge glitch in the rng. (merge)

I am not an expert in multithreading. I had to spend some time experimenting and testing to confirm the above logic. What is known about multithreads is that they share the memory; and each thread has its own stack. I am not sure what goes into the stack. Random generators are shared among the threads.

Here is some code to support the above. The batch size is 8. Therefore, each GPU will process 4 samples. The code has only one random instruction that happens in the forward. We perform only one call to the forward. The code consists in taking an input x, then adding to it a random value element-wise.

Code 1: threads share random generator.

This code has only and only two random output statutes. The output status depends on which thread, among the two threads, reaches FIRST to the random generator. This racing does not depend on you, your code, or how you designed it. It is up to the OS kernel scheduler. In every run, you may have a different results depending on the order which the threads were executed. Since we have only two threads, and only one random instruction. One call to the forward function will generate only two outcome states. We show the status of the random generator by displaying its signature before and after the calling the random generator. We assume that calling our random generator is
thread-safe (it means that if both threads call the random instruction in the same time nothing goes wrong.).

import random


import numpy as np


import torch
import torch.nn as nn
from torch.nn import DataParallel


def set_seed(seed):
    """
    For seed to some modules.
    :param seed: int. The seed.
    :return:
    """
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


class Model(nn.Module):

    def __init__(self):
        super(Model, self).__init__()

    def forward(self, x):
        if x.is_cuda:
            device = x.get_device()
            print("DEVICE: {}".format(x.get_device()))
            print("x.size() = {}".format(x.size()))
            print("x:\n {}".format(x))
        else:
            device = torch.device("cpu")

        prngs0 = torch.random.get_rng_state().type(torch.float).numpy()
        # computing sum(abs(diff)**2) is a better way to have a summary of the PRNGs instead of computing only the sum.
        # Two different PRNGs may have the same sum (may happen if we generate few random numbers).
        print("DEVICE {} PRNG STATUS BEFORE: \n {}".format(torch.cuda.current_device(),
                                                           np.abs(np.diff(prngs0)**2).sum()))

        delta = torch.rand(x.size()).to(device)
        prngs1 = torch.random.get_rng_state().type(torch.float).numpy()
        print("DEVICE {} PRNG STATUS AFTER: \n {}".format(torch.cuda.current_device(),
                                                          np.abs(np.diff(prngs1)**2).sum()))

        print("DEVICE {} DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): \n {}".format(
            torch.cuda.current_device(),   np.abs(prngs0 - prngs1).sum()))
        print("Delta:\n {}".format(delta))
        x = x + delta
        return x


if __name__ == "__main__":
    set_seed(0)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = Model()
    model = DataParallel(model)
    model.to(device)

    x = torch.rand(8, 3)
    x = x.to(device)
    print("X in:\n {}".format(x))
    print("X out:\n {}".format(model(x)))

run 1:

X in:
 tensor([[0.4963, 0.7682, 0.0885],
        [0.1320, 0.3074, 0.6341],
        [0.4901, 0.8964, 0.4556],
        [0.6323, 0.3489, 0.4017],
        [0.0223, 0.1689, 0.2939],
        [0.5185, 0.6977, 0.8000],
        [0.1610, 0.2823, 0.6816],
        [0.9152, 0.3971, 0.8742]], device='cuda:0')
DEVICE: 0
x.size() = torch.Size([4, 3])
DEVICE: 1
x.size() = torch.Size([4, 3])
x:
 tensor([[0.0223, 0.1689, 0.2939],
        [0.5185, 0.6977, 0.8000],
        [0.1610, 0.2823, 0.6816],
        [0.9152, 0.3971, 0.8742]], device='cuda:1')
DEVICE 1 PRNG STATUS BEFORE: 
 46790328.0
DEVICE 1 PRNG STATUS AFTER: 
 46787832.0
DEVICE 1 DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): 
 24.0
Delta:
 tensor([[0.4194, 0.5529, 0.9527],
        [0.0362, 0.1852, 0.3734],
        [0.3051, 0.9320, 0.1759],
        [0.2698, 0.1507, 0.0317]], device='cuda:1')
x:
 tensor([[0.4963, 0.7682, 0.0885],
        [0.1320, 0.3074, 0.6341],
        [0.4901, 0.8964, 0.4556],
        [0.6323, 0.3489, 0.4017]], device='cuda:0')
DEVICE 0 PRNG STATUS BEFORE: 
 46787832.0
DEVICE 0 PRNG STATUS AFTER: 
 46786488.0
DEVICE 0 DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): 
 24.0
Delta:
 tensor([[0.2081, 0.9298, 0.7231],
        [0.7423, 0.5263, 0.2437],
        [0.5846, 0.0332, 0.1387],
        [0.2422, 0.8155, 0.7932]], device='cuda:0')
X out:
 tensor([[0.7044, 1.6980, 0.8116],
        [0.8744, 0.8337, 0.8777],
        [1.0747, 0.9296, 0.5943],
        [0.8745, 1.1644, 1.1949],
        [0.4417, 0.7218, 1.2466],
        [0.5547, 0.8829, 1.1734],
        [0.4661, 1.2143, 0.8575],
        [1.1850, 0.5478, 0.9059]], device='cuda:0')

run 2:

X in:
 tensor([[0.4963, 0.7682, 0.0885],
        [0.1320, 0.3074, 0.6341],
        [0.4901, 0.8964, 0.4556],
        [0.6323, 0.3489, 0.4017],
        [0.0223, 0.1689, 0.2939],
        [0.5185, 0.6977, 0.8000],
        [0.1610, 0.2823, 0.6816],
        [0.9152, 0.3971, 0.8742]], device='cuda:0')
DEVICE: 0
x.size() = torch.Size([4, 3])
DEVICE: 1
x.size() = torch.Size([4, 3])
x:
 tensor([[0.0223, 0.1689, 0.2939],
        [0.5185, 0.6977, 0.8000],
        [0.1610, 0.2823, 0.6816],
        [0.9152, 0.3971, 0.8742]], device='cuda:1')
x:
 tensor([[0.4963, 0.7682, 0.0885],
        [0.1320, 0.3074, 0.6341],
        [0.4901, 0.8964, 0.4556],
        [0.6323, 0.3489, 0.4017]], device='cuda:0')
DEVICE 0 PRNG STATUS BEFORE: 
 46790328.0
DEVICE 1 PRNG STATUS BEFORE: 
 46790328.0
DEVICE 0 PRNG STATUS AFTER: 
 46786488.0
DEVICE 1 PRNG STATUS AFTER: 
 46786488.0
DEVICE 0 DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): 
 48.0
DEVICE 1 DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): 
 48.0
Delta:
 tensor([[0.4194, 0.5529, 0.9527],
        [0.0362, 0.1852, 0.3734],
        [0.3051, 0.9320, 0.1759],
        [0.2698, 0.1507, 0.0317]], device='cuda:0')
Delta:
 tensor([[0.2081, 0.9298, 0.7231],
        [0.7423, 0.5263, 0.2437],
        [0.5846, 0.0332, 0.1387],
        [0.2422, 0.8155, 0.7932]], device='cuda:1')
X out:
 tensor([[0.9157, 1.3211, 1.0412],
        [0.1682, 0.4927, 1.0075],
        [0.7952, 1.8284, 0.6315],
        [0.9021, 0.4996, 0.4334],
        [0.2305, 1.0987, 1.0170],
        [1.2609, 1.2240, 1.0437],
        [0.7456, 0.3154, 0.8203],
        [1.1574, 1.2126, 1.6673]], device='cuda:0')

These are the only possible outcomes. You see that there is only two unique deltas. Each one either generator by the first or the second thread depending on which call first torch.rand(). The signature of the PRNG is not perfect is a sens that two DIFFERENT PRNGs may have the same signature. It seems that the PRNG status is designed in some specific way with some properties. I didn’t spend much time to find a perfect unique signature.

Imagine you call the forward function 100 times. Predicting the output is impossible since it will be random. Now, you understand why it is impossible to obtain reproducible results when using multithreadings.

sbelharbi · June 17, 2019, 9:47pm

+++++++++++++++++++++++++++++++++++++++++++++++++++++++

                      PAGE 2/5

+++++++++++++++++++++++++++++++++++++++++++++++++++++++

The code above does not show necessarily that the threads share the random generator.
Now, we will do a trick to show that. We will slow thread 0 on purpose, and make thread 1 change back the status of the PRNG AFTER it has changed its status. This strategy will lead us to our next step on how to control the randomness within multithreads.

Code 1: threads share random generator (2): Resetting PRNG status.

import time
import random

import numpy as np

import torch
import torch.nn as nn
from torch.nn import DataParallel


def set_seed(seed):
    """
    For seed to some modules.
    :param seed: int. The seed.
    :return:
    """
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


class Model(nn.Module):

    def __init__(self):
        super(Model, self).__init__()

    def forward(self, x):
        if x.is_cuda:
            device = x.get_device()
            print("DEVICE: {}".format(x.get_device()))
            print("x.size() = {}".format(x.size()))
            print("x:\n {}".format(x))
        else:
            device = torch.device("cpu")

        prngs0 = torch.random.get_rng_state().type(torch.float).numpy()
        # computing sum(abs(diff)**2) is a better way to have a summary of the PRNGs instead of computing only the sum.
        # Two different PRNGs may have the same sum (may happen if we generate few random numbers).
        print("DEVICE {} PRNG STATUS BEFORE: \n {}".format(torch.cuda.current_device(),
                                                           np.abs(np.diff(prngs0)**2).sum()))

        set_seed(1)
        if torch.cuda.current_device() == 0:
            print("Thread on device 0 is going to sleep ....")
            time.sleep(10)
        delta = torch.rand(x.size()).to(device)
        if torch.cuda.current_device() == 1:
            print("Thread on device 1 has changed back the PRNG status ....")
            set_seed(1)
        prngs1 = torch.random.get_rng_state().type(torch.float).numpy()
        print("DEVICE {} PRNG STATUS AFTER: \n {}".format(torch.cuda.current_device(),
                                                          np.abs(np.diff(prngs1)**2).sum()))

        print("DEVICE {} DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): \n {}".format(
            torch.cuda.current_device(),   np.abs(prngs0 - prngs1).sum()))
        print("Delta:\n {}".format(delta))
        x = x + delta
        return x


if __name__ == "__main__":
    set_seed(0)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = Model()
    model = DataParallel(model)
    model.to(device)

    x = torch.rand(8, 3)
    x = x.to(device)
    print("X in:\n {}".format(x))
    print("X out:\n {}".format(model(x)))

run

X in:
 tensor([[0.4963, 0.7682, 0.0885],
        [0.1320, 0.3074, 0.6341],
        [0.4901, 0.8964, 0.4556],
        [0.6323, 0.3489, 0.4017],
        [0.0223, 0.1689, 0.2939],
        [0.5185, 0.6977, 0.8000],
        [0.1610, 0.2823, 0.6816],
        [0.9152, 0.3971, 0.8742]], device='cuda:0')
DEVICE: 0
DEVICE: 1
x.size() = torch.Size([4, 3])
x.size() = torch.Size([4, 3])
x:
 tensor([[0.0223, 0.1689, 0.2939],
        [0.5185, 0.6977, 0.8000],
        [0.1610, 0.2823, 0.6816],
        [0.9152, 0.3971, 0.8742]], device='cuda:1')
DEVICE 1 PRNG STATUS BEFORE: 
 46790328.0
Thread on device 1 has changed back the PRNG status ....
DEVICE 1 PRNG STATUS AFTER: 
 47205040.0
DEVICE 1 DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): 
 213383.0
Delta:
 tensor([[0.7576, 0.2793, 0.4031],
        [0.7347, 0.0293, 0.7999],
        [0.3971, 0.7544, 0.5695],
        [0.4388, 0.6387, 0.5247]], device='cuda:1')
x:
 tensor([[0.4963, 0.7682, 0.0885],
        [0.1320, 0.3074, 0.6341],
        [0.4901, 0.8964, 0.4556],
        [0.6323, 0.3489, 0.4017]], device='cuda:0')
DEVICE 0 PRNG STATUS BEFORE: 
 47205040.0
Thread on device 0 is going to sleep ....
DEVICE 0 PRNG STATUS AFTER: 
 49006000.0
DEVICE 0 DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): 
 212261.0
Delta:
 tensor([[0.7576, 0.2793, 0.4031],
        [0.7347, 0.0293, 0.7999],
        [0.3971, 0.7544, 0.5695],
        [0.4388, 0.6387, 0.5247]], device='cuda:0')
X out:
 tensor([[1.2539, 1.0475, 0.4915],
        [0.8667, 0.3367, 1.4339],
        [0.8872, 1.6508, 1.0251],
        [1.0711, 0.9876, 0.9264],
        [0.7800, 0.4482, 0.6970],
        [1.2532, 0.7269, 1.5999],
        [0.5582, 1.0366, 1.2511],
        [1.3540, 1.0358, 1.3988]], device='cuda:0')

The above code will generate only and only one outcome status. In other words, this code is deterministic. After thread 1 has altered the PRNG status, it changed it back to its original value BEFORE thread 0 made its call to the PRNG. Therefore, we know in advance before running this code what is the status of the PRNG, hence, the determinism.

So, in order to control the randomness we need to control the status of the PRNG. Our strategy is re-seeding. In order to do that, we had to introduce a sort of order between threads. Sending a thread to sleep is the worst way to do it (we did it just for illustration). Introducing an order in the scheduler may slow down the execution. So this needs to be done carefully. Of course, we are not going to send our threads to sleep. But, we will control the access to the danger zone. We will rely on locks and
seed control.

sbelharbi · June 17, 2019, 9:49pm

+++++++++++++++++++++++++++++++++++++++++++++++++++++++

                      PAGE 3/5

+++++++++++++++++++++++++++++++++++++++++++++++++++++++

To do that, we need to determine what are the parts of code that create randomness. Once determined, we put them in locks to synchronize the access to them. Such parts need to be the smallest possible so we do not lock the entire code and make it sequential. Remember locks are blocking. So, you have to use them wisely over the smallest region of instructions.

Code 3: Synchronize threads using locks to reset the PRNG status (thread-safe) like nothing happened.

import threading
import random

import numpy as np

import torch
import torch.nn as nn
from torch.nn import DataParallel


def set_seed(seed):
    """
    For seed to some modules.
    :param seed: int. The seed.
    :return:
    """
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


class Model(nn.Module):

    def __init__(self):
        super(Model, self).__init__()
        self.thread_lock = threading.Lock()

    def forward(self, x):
        if x.is_cuda:
            device = x.get_device()
            print("DEVICE: {}".format(x.get_device()))
            print("x.size() = {}".format(x.size()))
            print("x:\n {}".format(x))
        else:
            device = torch.device("cpu")

        prngs0 = torch.random.get_rng_state().type(torch.float).numpy()
        # computing sum(abs(diff)**2) is a better way to have a summary of the PRNGs instead of computing only the sum.
        # Two different PRNGs may have the same sum (may happen if we generate few random numbers).
        print("DEVICE {} PRNG STATUS BEFORE: \n {}".format(torch.cuda.current_device(),
                                                           np.abs(np.diff(prngs0)**2).sum()))

        self.thread_lock.acquire()
        set_seed(1)  # re-seed
        delta = torch.rand(x.size()).to(device)  # the danger zone.
        set_seed(1)  # re-seed like nothing happened!!!
        self.thread_lock.release()

        prngs1 = torch.random.get_rng_state().type(torch.float).numpy()
        print("DEVICE {} PRNG STATUS AFTER: \n {}".format(torch.cuda.current_device(),
                                                          np.abs(np.diff(prngs1)**2).sum()))

        print("DEVICE {} DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): \n {}".format(
            torch.cuda.current_device(),   np.abs(prngs0 - prngs1).sum()))
        print("Delta:\n {}".format(delta))
        x = x + delta
        return x


if __name__ == "__main__":
    set_seed(0)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = Model()
    model = DataParallel(model)
    model.to(device)

    x = torch.rand(8, 3)
    x = x.to(device)
    print("X in:\n {}".format(x))
    print("X out:\n {}".format(model(x)))

run:

X in:
 tensor([[0.4963, 0.7682, 0.0885],
        [0.1320, 0.3074, 0.6341],
        [0.4901, 0.8964, 0.4556],
        [0.6323, 0.3489, 0.4017],
        [0.0223, 0.1689, 0.2939],
        [0.5185, 0.6977, 0.8000],
        [0.1610, 0.2823, 0.6816],
        [0.9152, 0.3971, 0.8742]], device='cuda:0')
DEVICE: 0
x.size() = torch.Size([4, 3])
DEVICE: 1
x.size() = torch.Size([4, 3])
x:
 tensor([[0.0223, 0.1689, 0.2939],
        [0.5185, 0.6977, 0.8000],
        [0.1610, 0.2823, 0.6816],
        [0.9152, 0.3971, 0.8742]], device='cuda:1')
DEVICE 1 PRNG STATUS BEFORE: 
 46790328.0
DEVICE 1 PRNG STATUS AFTER: 
 47205040.0
DEVICE 1 DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): 
 213383.0
Delta:
 tensor([[0.7576, 0.2793, 0.4031],
        [0.7347, 0.0293, 0.7999],
        [0.3971, 0.7544, 0.5695],
        [0.4388, 0.6387, 0.5247]], device='cuda:1')
x:
 tensor([[0.4963, 0.7682, 0.0885],
        [0.1320, 0.3074, 0.6341],
        [0.4901, 0.8964, 0.4556],
        [0.6323, 0.3489, 0.4017]], device='cuda:0')
DEVICE 0 PRNG STATUS BEFORE: 
 47205040.0
DEVICE 0 PRNG STATUS AFTER: 
 47205040.0
DEVICE 0 DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): 
 0.0
Delta:
 tensor([[0.7576, 0.2793, 0.4031],
        [0.7347, 0.0293, 0.7999],
        [0.3971, 0.7544, 0.5695],
        [0.4388, 0.6387, 0.5247]], device='cuda:0')
X out:
 tensor([[1.2539, 1.0475, 0.4915],
        [0.8667, 0.3367, 1.4339],
        [0.8872, 1.6508, 1.0251],
        [1.0711, 0.9876, 0.9264],
        [0.7800, 0.4482, 0.6970],
        [1.2532, 0.7269, 1.5999],
        [0.5582, 1.0366, 1.2511],
        [1.3540, 1.0358, 1.3988]], device='cuda:0')

The above code is deterministic. Over each GPU, the PRNG generates the same random numbers.
Now, you may say that this is a good thing, and we can produce reproducibility whatever the number of GPUs. Not so fast!!! we will arrive to an issue (see code 5). In short, random number generation is sequential.

Ok, now we arrive to produce the same randomness over each GPU which is fine. But, you may not want that. You may want to have different BUT deterministic randomness on each GPU. To do that, we need to seed each thread differently. The code below provides a way of doing that.

sbelharbi · June 17, 2019, 9:50pm

+++++++++++++++++++++++++++++++++++++++++++++++++++++++

                      PAGE 4/5

+++++++++++++++++++++++++++++++++++++++++++++++++++++++

Code 4: Different deterministic randomness for each thread (GPU device).

In this case, each thread will generate different randomness using its own seed without affecting the other threads PRNGs, thanks to the locks.

import threading
import random

import numpy as np

import torch
import torch.nn as nn
from torch.nn import DataParallel


def set_seed(seed):
    """
    For seed to some modules.
    :param seed: int. The seed.
    :return:
    """
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


class Model(nn.Module):

    def __init__(self):
        super(Model, self).__init__()
        self.thread_lock = threading.Lock()

    def forward(self, x, seed):
        if x.is_cuda:
            device = x.get_device()
            print("DEVICE: {}".format(x.get_device()))
            print("x.size() = {}".format(x.size()))
            print("x:\n {}".format(x))
        else:
            device = torch.device("cpu")

        prngs0 = torch.random.get_rng_state().type(torch.float).numpy()
        # computing sum(abs(diff)**2) is a better way to have a summary of the PRNGs instead of computing only the sum.
        # Two different PRNGs may have the same sum (may happen if we generate few random numbers).
        print("DEVICE {} PRNG STATUS BEFORE: \n {}".format(torch.cuda.current_device(),
                                                           np.abs(np.diff(prngs0)**2).sum()))

        self.thread_lock.acquire()
        set_seed(seed)  # re-seed
        delta = torch.rand(x.size()).to(device)  # the danger zone.
        set_seed(seed)  # re-seed like nothing happened!!!
        self.thread_lock.release()

        prngs1 = torch.random.get_rng_state().type(torch.float).numpy()
        print("DEVICE {} PRNG STATUS AFTER: \n {}".format(torch.cuda.current_device(),
                                                          np.abs(np.diff(prngs1)**2).sum()))

        print("DEVICE {} DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): \n {}".format(
            torch.cuda.current_device(),   np.abs(prngs0 - prngs1).sum()))
        print("Delta:\n {}".format(delta))
        x = x + delta
        return x


if __name__ == "__main__":
    set_seed(0)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = Model()
    model = DataParallel(model)
    model.to(device)

    x = torch.rand(8, 3)
    x = x.to(device)
    print("X in:\n {}".format(x))
    print("X out:\n {}".format(model(x=x, seed=torch.tensor([10, 20]).to(device))))

run:

X in:
 tensor([[0.4963, 0.7682, 0.0885],
        [0.1320, 0.3074, 0.6341],
        [0.4901, 0.8964, 0.4556],
        [0.6323, 0.3489, 0.4017],
        [0.0223, 0.1689, 0.2939],
        [0.5185, 0.6977, 0.8000],
        [0.1610, 0.2823, 0.6816],
        [0.9152, 0.3971, 0.8742]], device='cuda:0')
DEVICE: 0
x.size() = torch.Size([4, 3])
DEVICE: 1
x.size() = torch.Size([4, 3])
x:
 tensor([[0.0223, 0.1689, 0.2939],
        [0.5185, 0.6977, 0.8000],
        [0.1610, 0.2823, 0.6816],
        [0.9152, 0.3971, 0.8742]], device='cuda:1')
DEVICE 1 PRNG STATUS BEFORE: 
 46790328.0
DEVICE 1 PRNG STATUS AFTER: 
 49251776.0
DEVICE 1 DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): 
 217190.0
Delta:
 tensor([[0.5615, 0.1774, 0.8147],
        [0.3295, 0.2319, 0.7832],
        [0.8544, 0.1012, 0.1877],
        [0.9310, 0.0899, 0.3156]], device='cuda:1')
x:
 tensor([[0.4963, 0.7682, 0.0885],
        [0.1320, 0.3074, 0.6341],
        [0.4901, 0.8964, 0.4556],
        [0.6323, 0.3489, 0.4017]], device='cuda:0')
DEVICE 0 PRNG STATUS BEFORE: 
 49251776.0
DEVICE 0 PRNG STATUS AFTER: 
 47405664.0
DEVICE 0 DIFF PRNG STATUS ABS(BEFORE - AFTER).SUM(): 
 218221.0
Delta:
 tensor([[0.4581, 0.4829, 0.3125],
        [0.6150, 0.2139, 0.4118],
        [0.6938, 0.9693, 0.6178],
        [0.3304, 0.5479, 0.4440]], device='cuda:0')
X out:
 tensor([[0.9543, 1.2511, 0.4010],
        [0.7471, 0.5214, 1.0459],
        [1.1839, 1.8658, 1.0734],
        [0.9627, 0.8968, 0.8457],
        [0.5838, 0.3463, 1.1086],
        [0.8481, 0.9295, 1.5832],
        [1.0154, 0.3834, 0.8693],
        [1.8462, 0.4870, 1.1898]], device='cuda:0')

The above code is deterministic, and produces different randomness per device.

Now, using code 4, one can produce reproducible results across a fixed number of devices K.

sbelharbi · June 17, 2019, 9:51pm

+++++++++++++++++++++++++++++++++++++++++++++++++++++++

                      PAGE 5/5

+++++++++++++++++++++++++++++++++++++++++++++++++++++++

Now back to the issue in code 3. We show in an illustrative example that random number generation is done sequentially. This implies that generating random numbers for 4 samples will provide the 4th sample with some random values that will be different IF the 4th sample was processed on another device with position 0 in the mini-batch. This prevents reproducibility across ALL K.
In this case, we see random numbers generation over CPU and GPU.

code 5: Random number generation is SEQUENTIAL whether on CPU or on GPU.

import threading
import random

import numpy as np

import torch
import torch.nn as nn
from torch.nn import DataParallel


def set_seed(seed):
    """
    For seed to some modules.
    :param seed: int. The seed.
    :return:
    """
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


class Model(nn.Module):

    def __init__(self):
        super(Model, self).__init__()
        self.thread_lock = threading.Lock()

    def forward(self, x, seed):
        if x.is_cuda:
            device = x.get_device()
        else:
            device = torch.device("cpu")

        self.thread_lock.acquire()
        print("Generating random number on CPU on DEVICE {}".format(torch.cuda.current_device()))
        for i in range(1, x.size()[0], 1):
            set_seed(seed)  # re-seed
            delta = torch.rand(i)
            print("Random array {} at DEVICE {}: {}".format(i, torch.cuda.current_device(), delta))
            set_seed(seed)  # re-seed like nothing happened!!!

        print("Generating random number on GPU on DEVICE {}".format(torch.cuda.current_device()))
        for i in range(1, x.size()[0], 1):
            set_seed(seed)  # re-seed
            shape = torch.Size((i, 1))
            delta = torch.cuda.FloatTensor(shape)
            torch.rand(shape, out=delta)
            print("Random array {} at DEVICE {}: {}".format(i, torch.cuda.current_device(), delta.view(-1)))
            set_seed(seed)  # re-seed like nothing happened!!!
        self.thread_lock.release()

        return x


if __name__ == "__main__":
    set_seed(0)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = Model()
    model = DataParallel(model)
    model.to(device)

    x = torch.rand(8, 3)
    x = x.to(device)
    model(x=x, seed=torch.tensor([10, 20]).to(device))

run:

Generating random number on CPU on DEVICE 0
Random array 1 at DEVICE 0: tensor([0.4581])
Random array 2 at DEVICE 0: tensor([0.4581, 0.4829])
Random array 3 at DEVICE 0: tensor([0.4581, 0.4829, 0.3125])
Generating random number on GPU on DEVICE 0
Random array 1 at DEVICE 0: tensor([0.2479], device='cuda:0')
Random array 2 at DEVICE 0: tensor([0.2479, 0.8615], device='cuda:0')
Random array 3 at DEVICE 0: tensor([0.2479, 0.8615, 0.0881], device='cuda:0')
Generating random number on CPU on DEVICE 1
Random array 1 at DEVICE 1: tensor([0.5615])
Random array 2 at DEVICE 1: tensor([0.5615, 0.1774])
Random array 3 at DEVICE 1: tensor([0.5615, 0.1774, 0.8147])
Generating random number on GPU on DEVICE 1
Random array 1 at DEVICE 1: tensor([0.8389], device='cuda:1')
Random array 2 at DEVICE 1: tensor([0.8389, 0.3764], device='cuda:1')
Random array 3 at DEVICE 1: tensor([0.8389, 0.3764, 0.6292], device='cuda:1')

You see that numbers are generated sequentially.

You may ask why the numbers are different from CPU to GPU?. I do not have an explanation. I expect to have the same numbers. MY GUESS is that it may have something to do with the sampling algorithm. They may be using different algorithm depending if the sampling is done on CPU or GPU [check THRandom.cpp]. Pytorch does not guarantee reproducibility between CPU and GPU.

The consequence of the previous note is that if you use for instance dropout, the mask generated for a sample depends on its location on which device and on its order in the minibatch. If we fixed the same seed all over the devices, it will depend only on its position in the minibatch. This position will change when changing the number of devices K. This makes reproducibility across all K impossible even using our re-seeding strategy. Now, you may say, what if we fix a seed per SAMPLE so to be independent of any device. Conceptually, that will solve the problem. However, in practice, this may slow down the computation since you will perform sequential processing of sample. For instance, in the dropout, you have to forward each sample separately to be able to fix its own seed. This is not practical at all.

Last notes/reminders:

If you want to make your code reproducible for each fixed K GPUs, you can use a seed per thread, combined with locks. You need to lock the smallest number of instructions that produce randomness. If a model produces randomness, do not lock the entire call to the model, but get in the model, and lock only the instructions that generate randomness.
You need to deal with BatchNorm to not breakdown performance, in case BN is used by using synchronized BN across all GPUs.
Reproducibility in Pytorch still needs a lot of work.