DataLoader is not deterministic

Hi,

I am using DataLoader and even with single worker getting different generation of image order. How can I make it generate the same image order every time ? (and does it depend on #workers?)

Code is initialized with:

random.seed(1)
torch.manual_seed(1)
torch.backends.cudnn.deterministic = True

In addition, I would like to be able to serialize DataLoader’s internal state to file so that if I stop the run at the middle of an epoch and resume it later then I can continue from the same place it stopped.

Thank you,
Moshe

2 Likes

The DataLoader should provide the same random ordering when seeded and if there are no race conditions.
This example gives the same ordering for num_workers=0 or =1.

torch.manual_seed(2809)
torch.backends.cudnn.deterministic = True # Not necessary in this example

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(25, 1)
        
    def __getitem__(self, index):
        print('Index: ', index)
        return self.data[index]

    def __len__(self):
        return len(self.data)


dataset = MyDataset()
loader = DataLoader(dataset,
                    batch_size=5,
                    shuffle=True,
                    num_workers=0,
                    pin_memory=True)

for batch_idx, data in enumerate(loader):
    data = data.to('cuda')

If you use a higher number of workers the order or the samples might differ.
At least I observe this effect on my machine. I assume this is due to the race conditions in different processes.

1 Like

Hi,

Thanks for the code. Your example indeed reproduce numbers in same orders over different trials.

In my code, however, which is very similar in principle to this example, I get a different behavior even for #workers=1. I don’t have a clue where to start looking for the source of this problem.

I would appreciate any advice.

Please note that I also asked about saving the state of DataLoader so that I can stop and resume run during epoch… If you can then please address that part as well.

Thank you,
Moshe

Do you call any other random functions from another lib like numpy?
If it’s possible, I would remove unnecessary parts of the code and see if a minimal example produces the deterministic behavior.

Regarding your second question:
@albanD posted a suggestion here

Hi,

Thanks… I traced the problem to (unordered) dictionary creation and iteration which led to different results every run. After switching to OrderedDict things work as expected. I wasn’t aware that there is any randomness involved in standard dictionary (thought the keys are unordered but not in a random sense).

Regarding my second question, I went over the linked code but I still would like to use RandomSampler. If I understand correctly, it is sufficient to record 2 variable: the RNG state when creating the iterator over DataLoader (since this is when the random permutation is taken) and also how many “next” requests it received so far. Is it correct that these 2 variable capture the state of the DataLoader ?

Thanks,
-Moshe

Nice you’ve found the problematic part!

I think you are basically right. The use case would be a bit more complicated if you use more workers.
Also how are you going to stop the DataLoader? Do you want to stop it with CTRL+C?
If so, you have to take care of stopping all workers, since I’ve seen quite often zombie python processes from stopped DataLoaders.

I’m not sure, if this will yield to a clean solution.

Why torch.backends.cudnn.deterministic = True does make speed of my code too slow? I am using TitanXp. Without the flag, the code run fast. I am also using cudnn 7 and cuda 9.1 pytotch 0.4

Yes, cudnn.deterministic=True trades speed for determinism.
If you really need the deterministic behavior, you won’t have another option.
Usually, it’s fine to leave it disabled and also set torch.backends.cudnn.benchmark = True to gain a bit more speed.

Thanks. I am writing a network that train from scratch with own dataset. My aim to reproduce the result after each running. However, i cannot reproduce it ( because the losses are different when i rerun the code). This is my setting

random.seed(1234) #  np.randint()
torch.manual_seed(1234)
cudnn.enabled = True
cudnn.benchmark=True

Do you have any suggession?

How large is the difference in the results?

The first three epoch of the first running like

0 | 1.78229 
1 | 1.31438 
2 | 1.06264 
3 | 0.95527 
4 | 1.22681 
5 | 0.87360 
6 | 0.84937 
7 | 0.82182 
8 | 0.73692

The second running is

0 | 1.65189 
1 | 1.30213 
2 | 1.23979 
3 | 1.01453 
4 | 0.94315 
5 | 0.78658 
6 | 0.88853 
7 | 0.71684 
8 | 0.64936

With cudnn.benckmark=False and cudnn.deterministic=True the values are identical?
The difference seems to be quite large.

Sorry. I have updated the new result. It looks also too larger. I will test your case

This is the result based on your settinh

  0 | 1.63084 
  1 | 1.26171 
  2 | 1.00780 
  3 | 0.87280

And

    0 | 1.62280 
    1 | 1.32109 
    2 | 1.16098

But the problem is that setting deterministic=True spends a lot of time for 1 epoch (about 100x slower)

I just wanted to see, if the results are equal in such a case, which doesn’t seem to be the case.
Maybe the difference is based on some other effect like the dict/OrderedDict issue?

How to check if i used it? I remember i did not use them

Could you try to get deterministic results by setting the seeds, set cudnn to deterministic etc.?
This would exclude any possible effects like unknown order in python dicts like in @MosheM’s case.

Hi @ptrblck, I wasn’t able to obtain the same random ordering from your example in pytorch>=1.1.0.

Here’s the notebook. Is there a way to make the dataloader deterministic in newer pytorch versions?