DataLoader is not deterministic

MosheM · June 6, 2018, 3:28am

Hi,

I am using DataLoader and even with single worker getting different generation of image order. How can I make it generate the same image order every time ? (and does it depend on #workers?)

Code is initialized with:

random.seed(1)
torch.manual_seed(1)
torch.backends.cudnn.deterministic = True

In addition, I would like to be able to serialize DataLoader’s internal state to file so that if I stop the run at the middle of an epoch and resume it later then I can continue from the same place it stopped.

Thank you,
Moshe

ptrblck · June 6, 2018, 12:38pm

The DataLoader should provide the same random ordering when seeded and if there are no race conditions.
This example gives the same ordering for num_workers=0 or =1.

torch.manual_seed(2809)
torch.backends.cudnn.deterministic = True # Not necessary in this example

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(25, 1)
        
    def __getitem__(self, index):
        print('Index: ', index)
        return self.data[index]

    def __len__(self):
        return len(self.data)


dataset = MyDataset()
loader = DataLoader(dataset,
                    batch_size=5,
                    shuffle=True,
                    num_workers=0,
                    pin_memory=True)

for batch_idx, data in enumerate(loader):
    data = data.to('cuda')

If you use a higher number of workers the order or the samples might differ.
At least I observe this effect on my machine. I assume this is due to the race conditions in different processes.

MosheM · June 6, 2018, 6:59pm

Hi,

Thanks for the code. Your example indeed reproduce numbers in same orders over different trials.

In my code, however, which is very similar in principle to this example, I get a different behavior even for #workers=1. I don’t have a clue where to start looking for the source of this problem.

I would appreciate any advice.

Please note that I also asked about saving the state of DataLoader so that I can stop and resume run during epoch… If you can then please address that part as well.

Thank you,
Moshe

ptrblck · June 6, 2018, 10:29pm

Do you call any other random functions from another lib like numpy?
If it’s possible, I would remove unnecessary parts of the code and see if a minimal example produces the deterministic behavior.

Regarding your second question:
@albanD posted a suggestion here

MosheM · June 8, 2018, 7:05am

Hi,

Thanks… I traced the problem to (unordered) dictionary creation and iteration which led to different results every run. After switching to OrderedDict things work as expected. I wasn’t aware that there is any randomness involved in standard dictionary (thought the keys are unordered but not in a random sense).

Regarding my second question, I went over the linked code but I still would like to use RandomSampler. If I understand correctly, it is sufficient to record 2 variable: the RNG state when creating the iterator over DataLoader (since this is when the random permutation is taken) and also how many “next” requests it received so far. Is it correct that these 2 variable capture the state of the DataLoader ?

Thanks,
-Moshe

ptrblck · June 8, 2018, 9:33am

Nice you’ve found the problematic part!

I think you are basically right. The use case would be a bit more complicated if you use more workers.
Also how are you going to stop the DataLoader? Do you want to stop it with CTRL+C?
If so, you have to take care of stopping all workers, since I’ve seen quite often zombie python processes from stopped DataLoaders.

I’m not sure, if this will yield to a clean solution.

John1231983 · June 8, 2018, 11:57am

Why torch.backends.cudnn.deterministic = True does make speed of my code too slow? I am using TitanXp. Without the flag, the code run fast. I am also using cudnn 7 and cuda 9.1 pytotch 0.4

ptrblck · June 8, 2018, 12:05pm

Yes, cudnn.deterministic=True trades speed for determinism.
If you really need the deterministic behavior, you won’t have another option.
Usually, it’s fine to leave it disabled and also set torch.backends.cudnn.benchmark = True to gain a bit more speed.

John1231983 · June 8, 2018, 12:17pm

Thanks. I am writing a network that train from scratch with own dataset. My aim to reproduce the result after each running. However, i cannot reproduce it ( because the losses are different when i rerun the code). This is my setting

random.seed(1234) #  np.randint()
torch.manual_seed(1234)
cudnn.enabled = True
cudnn.benchmark=True

Do you have any suggession?

ptrblck · June 8, 2018, 12:25pm

How large is the difference in the results?

John1231983 · June 8, 2018, 12:28pm

The first three epoch of the first running like

0 | 1.78229 
1 | 1.31438 
2 | 1.06264 
3 | 0.95527 
4 | 1.22681 
5 | 0.87360 
6 | 0.84937 
7 | 0.82182 
8 | 0.73692

The second running is

0 | 1.65189 
1 | 1.30213 
2 | 1.23979 
3 | 1.01453 
4 | 0.94315 
5 | 0.78658 
6 | 0.88853 
7 | 0.71684 
8 | 0.64936

ptrblck · June 8, 2018, 12:34pm

With cudnn.benckmark=False and cudnn.deterministic=True the values are identical?
The difference seems to be quite large.

John1231983 · June 8, 2018, 12:44pm

Sorry. I have updated the new result. It looks also too larger. I will test your case

John1231983 · June 8, 2018, 1:03pm

This is the result based on your settinh

  0 | 1.63084 
  1 | 1.26171 
  2 | 1.00780 
  3 | 0.87280

And

    0 | 1.62280 
    1 | 1.32109 
    2 | 1.16098

But the problem is that setting deterministic=True spends a lot of time for 1 epoch (about 100x slower)

ptrblck · June 8, 2018, 1:17pm

I just wanted to see, if the results are equal in such a case, which doesn’t seem to be the case.
Maybe the difference is based on some other effect like the dict/OrderedDict issue?

John1231983 · June 8, 2018, 1:23pm

How to check if i used it? I remember i did not use them

ptrblck · June 8, 2018, 1:39pm

Could you try to get deterministic results by setting the seeds, set cudnn to deterministic etc.?
This would exclude any possible effects like unknown order in python dicts like in @MosheM’s case.

deepanwayx · December 21, 2019, 3:41am

Hi @ptrblck, I wasn’t able to obtain the same random ordering from your example in pytorch>=1.1.0.

Here’s the notebook. Is there a way to make the dataloader deterministic in newer pytorch versions?